<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Computer Vision AI</title>
    <link>https://honey-vision.tistory.com/</link>
    <description>안녕하세요. AI 모델을 공부하고 있는 석사과정 학생입니다.</description>
    <language>ko</language>
    <pubDate>Tue, 19 May 2026 06:10:54 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>honey-vision</managingEditor>
    <image>
      <title>Computer Vision AI</title>
      <url>https://tistory1.daumcdn.net/tistory/7060573/attach/c4f0f907b2454dc48df8b72edf1346e4</url>
      <link>https://honey-vision.tistory.com</link>
    </image>
    <item>
      <title>[논문리뷰] DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations</title>
      <link>https://honey-vision.tistory.com/135</link>
      <description>&lt;h2 data-end=&quot;332&quot; data-start=&quot;52&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Abstract&lt;/span&gt;&lt;/h2&gt;
&lt;p data-end=&quot;332&quot; data-start=&quot;52&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;디퓨전 모델 기반 VSR 태스크의 문제&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-end=&quot;332&quot; data-start=&quot;52&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- fidelity와 temporal consistency 유지&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;332&quot; data-start=&quot;52&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;332&quot; data-start=&quot;52&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;기존 방법들의 문제&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-end=&quot;332&quot; data-start=&quot;52&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 심하게 열화된(degraded) 비디오에서는 위의 문제를 해결하기 어려움&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;332&quot; data-start=&quot;52&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;--&amp;gt; 디퓨전 모델의 생성 능력이 가장 필요한 지점에서 오히려 잘 적용되지 못한다는 것을 의미함&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;476&quot; data-start=&quot;334&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 학습 부담, 고품질 학습 데이터의 제한&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;665&quot; data-start=&quot;478&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;665&quot; data-start=&quot;478&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;제안하는 방법&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-end=&quot;665&quot; data-start=&quot;478&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- Real world vsr의 개선을 위해 아키텍처 복잡성보다는 학습 전략에 중점을 둔 &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;DiffVSR을 제안함&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;github: &lt;a style=&quot;color: #000000;&quot; href=&quot;https://xh9998.github.io/DiffVSR-project/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://xh9998.github.io/DiffVSR-project/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;887&quot; data-origin-height=&quot;479&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/canFSG/dJMb99Ls1aS/CLzk4xgNBRE9aIsu0qJtg0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/canFSG/dJMb99Ls1aS/CLzk4xgNBRE9aIsu0qJtg0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/canFSG/dJMb99Ls1aS/CLzk4xgNBRE9aIsu0qJtg0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcanFSG%2FdJMb99Ls1aS%2FCLzk4xgNBRE9aIsu0qJtg0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;740&quot; height=&quot;400&quot; data-origin-width=&quot;887&quot; data-origin-height=&quot;479&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-end=&quot;272&quot; data-start=&quot;61&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1.&amp;nbsp;Introduction&lt;/span&gt;&lt;/h2&gt;
&lt;p data-end=&quot;272&quot; data-start=&quot;61&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;VSR(Video Super-Resolution): 저해상도(LR) 영상에서 복잡한 열화를 거친 고해상도(HR) 비디오를 복원하는 기술&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;620&quot; data-start=&quot;386&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;620&quot; data-start=&quot;386&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;기존 기법들의 문제&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-end=&quot;620&quot; data-start=&quot;386&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 결과를 과도하게 부드럽게 만들어 유화 그림처럼 표현됨&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;620&quot; data-start=&quot;386&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 복잡한 아티팩트 제거 안됨&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;620&quot; data-start=&quot;386&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;--&amp;gt; 이러한 문제점이 디퓨전 모델의 생성 능력이 가장 필요한 시점&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;915&quot; data-start=&quot;748&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;915&quot; data-start=&quot;748&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;주된 병목 현상이 아키텍처 설계 자체가 아닌 디퓨전 모델에 가해지는 과중한 학습 부담이 근본적인 문제라고 봄&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;915&quot; data-start=&quot;748&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;915&quot; data-start=&quot;748&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;본 논문에서 주장하는 근본적인 문제점&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-end=&quot;915&quot; data-start=&quot;748&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 열화 분포, 콘텐츠 표현, 시간적 관계, 지각 품질 최적화를 동시에 학습하기 때문에 학습에 부담이 있을 것&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;915&quot; data-start=&quot;748&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 사용 가능한 고품질 학습 데이터는 매우 제한적임&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;173&quot; data-start=&quot;45&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;351&quot; data-start=&quot;175&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;제안한 DiffVSR 핵심&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-end=&quot;351&quot; data-start=&quot;175&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 단계적 학습 전략(Progressive Learning Strategy, PLS)으로 학습 부담 분해&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;351&quot; data-start=&quot;175&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- Interweaved Latent Transition (ILT) 기법 개발: 추가 학습이나 복잡한 정렬 작업 없이 비디오 구간을 통합이 가능&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;351&quot; data-start=&quot;175&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- multi-scale temporal attention과 temporal-enhanced VAE 같은 아키텍처 구성 요소도 포함시켜 시너지 효과를 일으키게 함&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;351&quot; data-start=&quot;175&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- ablation 실험을 통해 심하게 열화된 비디오를 처리할 때는 PLS가 가져오는 성능 향상이 훨씬 더 두드러짐. &lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;351&quot; data-start=&quot;175&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;351&quot; data-start=&quot;175&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;즉, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;왜 기존의 많은 접근법들이 더 복잡한 아키텍처를 가지고 있음에도 불구하고 여전히 심각한 열화 상황에서 성능이 낮은지를 보여줌&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-end=&quot;1191&quot; data-start=&quot;1104&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;&lt;u&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;아키텍처 복잡성보다 학습 부담을 어떻게 다루는지가 더 본질적일 수 있음&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1191&quot; data-start=&quot;1104&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;57&quot; data-start=&quot;35&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;본 논문의 기여점&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-end=&quot;57&quot; data-start=&quot;35&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 단계적 학습 전략&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;(Progressive Learning Strategy, PLS)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;57&quot; data-start=&quot;35&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;- Interweaved Latent Transition (ILT)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;57&quot; data-start=&quot;35&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;- 다양한&amp;nbsp;열화&amp;nbsp;복잡도에&amp;nbsp;대한&amp;nbsp;광범위한&amp;nbsp;실험&amp;nbsp;평가&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR'; font-size: 16px; letter-spacing: 0px;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2.&amp;nbsp;Related&amp;nbsp;Work &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Video&amp;nbsp;Super-Resolution.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;전통적인 VSR 기법: 시간 정보를 효과적으로 활용하기 위해 아키텍처 설계에 주로 초점을 맞췄음&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- deformable convolution을 도입한 방법: TDAN, EDVR&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 순환 구조(recurrent structure): &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;BasicVSR, BasicVSR++&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;497&quot; data-start=&quot;475&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Diffusion&amp;nbsp;Models&amp;nbsp;for&amp;nbsp;Image&amp;nbsp;Restoration.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;652&quot; data-start=&quot;499&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 디퓨전 모델은 generative prior을 제공함으로써 이미지 image&amp;nbsp;restoration에 혁신을 가져옴&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;652&quot; data-start=&quot;499&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 심한 열화 이미지에서도 성능을 입증함&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;809&quot; data-start=&quot;787&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Diffusion&amp;nbsp;Models&amp;nbsp;for&amp;nbsp;Video&amp;nbsp;Restoration.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;809&quot; data-start=&quot;787&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 최근 연구들은 temporal&amp;nbsp;consistency 유지를 위한 아키텍처 혁신에 초점을 맞춰왔음&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Upscale-A-Video는 시간 레이어와 순환 전파를 사용하고, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;MGLD는 optical flow 기반 guidance를 통합하는 방법을 제안함&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;809&quot; data-start=&quot;787&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;---&amp;gt; 그러나&amp;nbsp; 이러한 기법들 역시 심한 열화 비디오에서는 여전히 취약함&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;809&quot; data-start=&quot;787&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;809&quot; data-start=&quot;787&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;본 논문은 이러한 근본적인 문제를 해결하기 위해 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;i&gt;'모델이 어떻게 학습하는가'&lt;/i&gt;를 재고하는 방향을 제안함&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3.&amp;nbsp;Method&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1189&quot; data-origin-height=&quot;431&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qlAoi/dJMcafEVwfb/mrrLe2HUGaf4iUQZ2PpvSk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qlAoi/dJMcafEVwfb/mrrLe2HUGaf4iUQZ2PpvSk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qlAoi/dJMcafEVwfb/mrrLe2HUGaf4iUQZ2PpvSk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqlAoi%2FdJMcafEVwfb%2FmrrLe2HUGaf4iUQZ2PpvSk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1189&quot; height=&quot;431&quot; data-origin-width=&quot;1189&quot; data-origin-height=&quot;431&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-end=&quot;660&quot; data-start=&quot;392&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;660&quot; data-start=&quot;392&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Progressive Learning Strategy(PLS) 도입:&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 학습 부담을 3가지 측면으로 분해하고 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;점진적으로 확장해 나갈 수 있도록 도움&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;660&quot; data-start=&quot;392&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Interweaved Latent Transition (ILT) 기법 제안: 추가&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;적인 학습 비용이 없어도 되도록 함&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;824&quot; data-start=&quot;791&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;824&quot; data-start=&quot;791&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;실험을 통해, 복잡한 복원 작업에서의 성능 향상은 아키텍처 복잡성이 아니라 &lt;u&gt;PLS와 같은 학습 전략에 의해 주로 결정됨&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;824&quot; data-start=&quot;791&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;기존의 video restoration 연구에서 중시되어 온 아키텍처 중심 접근에 도전하는 결과이며 &lt;u&gt;디퓨전 모델이 지닌 능력을 끌어내는 핵심이 학습 전략에 있다는 점을 입증함&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h3 data-end=&quot;824&quot; data-start=&quot;791&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3.1.&amp;nbsp;Preliminary:&amp;nbsp;Generative&amp;nbsp;Diffusion&amp;nbsp;Prior&lt;/span&gt;&lt;/h3&gt;
&lt;p data-end=&quot;381&quot; data-start=&quot;150&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 본 연구에서는 Stable Diffusion x4 Upscaler를 백본으로 선택&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- LDM 기반&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 오토인코더(encoder E), 디코더 D, conditional denoising U-Net을 핵심으로 함&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;381&quot; data-start=&quot;150&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;381&quot; data-start=&quot;150&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;학습 중에선 실제 데이터 분포로부터 추출된 latent samples에 대해 가우시안 노이즈를 스케줄에 따라 노이즈가 섞인 latents를 생성함&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;342&quot; data-origin-height=&quot;31&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/tAk3R/dJMcacOYfsl/56TXvSZkm3regF1vA9qUN1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/tAk3R/dJMcacOYfsl/56TXvSZkm3regF1vA9qUN1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/tAk3R/dJMcacOYfsl/56TXvSZkm3regF1vA9qUN1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FtAk3R%2FdJMcacOYfsl%2F56TXvSZkm3regF1vA9qUN1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;199&quot; height=&quot;18&quot; data-origin-width=&quot;342&quot; data-origin-height=&quot;31&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;저해상도 입력 x에도 초기 디퓨전 단계 타우에 해당하는 노이즈를 추가하여 세부 정보 생성 능력을 향상시킴&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;172&quot; data-origin-height=&quot;24&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/2mTTU/dJMcaaXVchi/qd9z65t44YL3j0Fbm5Tc6k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/2mTTU/dJMcaaXVchi/qd9z65t44YL3j0Fbm5Tc6k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/2mTTU/dJMcaaXVchi/qd9z65t44YL3j0Fbm5Tc6k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F2mTTU%2FdJMcaaXVchi%2Fqd9z65t44YL3j0Fbm5Tc6k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;108&quot; height=&quot;15&quot; data-origin-width=&quot;172&quot; data-origin-height=&quot;24&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;v-prediction parameterization을 사용하여 타깃 벡터에 대해 최적화 됨&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;494&quot; data-origin-height=&quot;43&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/on3UB/dJMcadf2vY7/t2Ja0KLPkHkx8vwwi8lKK1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/on3UB/dJMcadf2vY7/t2Ja0KLPkHkx8vwwi8lKK1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/on3UB/dJMcadf2vY7/t2Ja0KLPkHkx8vwwi8lKK1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fon3UB%2FdJMcadf2vY7%2Ft2Ja0KLPkHkx8vwwi8lKK1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;299&quot; height=&quot;26&quot; data-origin-width=&quot;494&quot; data-origin-height=&quot;43&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-end=&quot;1556&quot; data-start=&quot;1443&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;1556&quot; data-start=&quot;1443&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;추론 시, 모델은 저해상도 입력에 조건화된 상태에서 잠재 표현을 반복적으로 디노이징하며&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1556&quot; data-start=&quot;1443&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;텍스트 프롬프트 및 노이즈 스케줄링을 통해 샘플링 과정을 유연하게 제어할 수 있음&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-end=&quot;149&quot; data-start=&quot;95&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3.2.&amp;nbsp;Progressive&amp;nbsp;Learning&amp;nbsp;Strategy&lt;/span&gt;&lt;/h3&gt;
&lt;p data-end=&quot;203&quot; data-start=&quot;151&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;VSR 성능의 주요 병목 현상&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;272&quot; data-start=&quot;207&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 디퓨전 모델이 열화 분포, 콘텐츠 표현, 시간적 관계를 동시에 학습해야 하는 과중한 학습 부담.&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;430&quot; data-start=&quot;274&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;---&amp;gt; 학습 과정을 &lt;span style=&quot;color: #ee2323;&quot;&gt;&lt;b&gt;&lt;u&gt;세 가지 차원(열화 복잡도, 데이터셋 품질, 파라미터 최적화)으로 분해&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;하는 PLS를 제안&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;659&quot; data-origin-height=&quot;498&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/CWVeZ/dJMcaeeWZFK/UdAdAc4Bi6VIhs90QQ1NZ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/CWVeZ/dJMcaeeWZFK/UdAdAc4Bi6VIhs90QQ1NZ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/CWVeZ/dJMcaeeWZFK/UdAdAc4Bi6VIhs90QQ1NZ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FCWVeZ%2FdJMcaeeWZFK%2FUdAdAc4Bi6VIhs90QQ1NZ1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;659&quot; height=&quot;498&quot; data-origin-width=&quot;659&quot; data-origin-height=&quot;498&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-end=&quot;494&quot; data-start=&quot;437&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Stage 1: Temporal Layer Fine-tuning. --&amp;gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;temporal consistency 먼저 형성&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;494&quot; data-start=&quot;437&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; - 사전 학습된 이미지 디퓨전 모델의 spatial layers은 모두 freeze&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- temporal layers만&amp;nbsp;대규모 데이터로 fine-tuning&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- simple degradation만 적용 (가우시안 블러, bicubic 다운샘플링 등등)&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Stage 2: Complex Degradation Adaptation.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1188&quot; data-start=&quot;1096&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 1단계를 바탕으로 real-world distortion들을 점진적으로 추가함&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 데이터셋의 규모는 유지하며 복잡도를 증가시킴&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1244&quot; data-start=&quot;1195&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Stage 3: High-quality Refinement.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1244&quot; data-start=&quot;1195&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; - 전체 파라미터 unfreeze&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 복잡한 열화를 포함하는 high-quality video 데이터로 모델 전체를 fine-tuning&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-end=&quot;125&quot; data-start=&quot;82&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3.3.&amp;nbsp;Interweaved&amp;nbsp;Latent&amp;nbsp;Transition&lt;/span&gt;&lt;/h3&gt;
&lt;p data-end=&quot;281&quot; data-start=&quot;127&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 길이가 긴 비디오를 처리할 때 발생하는 문제: boundary inconsistency로 인해 시각적 품질이 저하됨&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;520&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lbOBt/dJMcadUEaDS/wF6s7MYoqQdR2HBrWBPo41/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lbOBt/dJMcadUEaDS/wF6s7MYoqQdR2HBrWBPo41/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lbOBt/dJMcadUEaDS/wF6s7MYoqQdR2HBrWBPo41/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlbOBt%2FdJMcadUEaDS%2FwF6s7MYoqQdR2HBrWBPo41%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;531&quot; height=&quot;247&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;520&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-end=&quot;556&quot; data-start=&quot;457&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;556&quot; data-start=&quot;457&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 전체 비디오는 중첩되는(sub-overlapping) 서브시퀀스 집합으로 분해됨&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;437&quot; data-origin-height=&quot;85&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bBlU02/dJMcac9g22A/Wvkklmj6wnHMzYvPR0OQw0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bBlU02/dJMcac9g22A/Wvkklmj6wnHMzYvPR0OQw0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bBlU02/dJMcac9g22A/Wvkklmj6wnHMzYvPR0OQw0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbBlU02%2FdJMcac9g22A%2FWvkklmj6wnHMzYvPR0OQw0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;149&quot; height=&quot;29&quot; data-origin-width=&quot;437&quot; data-origin-height=&quot;85&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 각 서브시퀀스는는 U-Net츨 통해 처리되며 대응하는 F가 생성됨&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 인접한 두 시퀀스 사이의 중첩 구간에 대해서 position-based interpolation을 수행함&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 중첩 영역 내 보간된 latent는 (2)번식과 같이 계산됨&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;797&quot; data-origin-height=&quot;77&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/k6sZl/dJMcaezfG93/XjGrK6UbTdK6vujCBS6R8K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/k6sZl/dJMcaezfG93/XjGrK6UbTdK6vujCBS6R8K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/k6sZl/dJMcaezfG93/XjGrK6UbTdK6vujCBS6R8K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fk6sZl%2FdJMcaezfG93%2FXjGrK6UbTdK6vujCBS6R8K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;352&quot; height=&quot;34&quot; data-origin-width=&quot;797&quot; data-origin-height=&quot;77&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt;Noise&amp;nbsp;rescheduling.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- temporal coherence를 더욱 강화하기 위해 noise rescheduling mechanism이 통합됨&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 이전 연구들에 따르면, 비디오 디퓨전 모델의 temporal consistency는 입력 컨텐츠와 초기 sampling noise 모두에 영향을 받음&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 첫 번째 서브시퀀스에서는 무작위 노이즈 프레임을 생성함&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;277&quot; data-origin-height=&quot;49&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cevf87/dJMcafSsMnz/LI4HjrzfXYjpdnNf0XSapk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cevf87/dJMcafSsMnz/LI4HjrzfXYjpdnNf0XSapk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cevf87/dJMcafSsMnz/LI4HjrzfXYjpdnNf0XSapk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcevf87%2FdJMcafSsMnz%2FLI4HjrzfXYjpdnNf0XSapk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;153&quot; height=&quot;27&quot; data-origin-width=&quot;277&quot; data-origin-height=&quot;49&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 이후 &lt;b&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;서브시퀀스의 중첩 영역에서는 노이즈 프레임을 &lt;u&gt;reuse 및 reorder&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이 과정은 프레임 간 &lt;u&gt;디퓨전 과정을 동기화&lt;/u&gt;해서&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;temporal jitter을 최소화 할 수 있고,&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;추가적인 모델 학습이나 계산 비용 없이 높은 시간적 일관성을 달성할 수 있음&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;3.4.&amp;nbsp;Architectural&amp;nbsp;Components&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;보조 모듈들&lt;/span&gt;은 PLS와 시너지를 이루며 전체적인 성능을 더욱 향상시켜줌&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Multi-Scale&amp;nbsp;Temporal&amp;nbsp;Attention.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 여러 해상도 스케일 간의 정보를 융합함&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Temporal-Enhanced&amp;nbsp;VAE.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 기존의 2D VAE 구조를 확장한 Temporal-Enhanced 3D VAE (TE-3DVAE)를 도입&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3D residual block과 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;temporal attention layers를 추가하여 설계&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 손실 항목들의 조합으로 학습됨&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;L1 reconstruction loss, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Perceptual loss, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Adversarial loss(시간 기반 PatchGAN 판별기를 이용)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.&amp;nbsp;Experiments&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1469&quot; data-origin-height=&quot;878&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lTnIa/dJMcajm1Fc2/lyQeYMU1WkN7sbTzdwPlQ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lTnIa/dJMcajm1Fc2/lyQeYMU1WkN7sbTzdwPlQ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lTnIa/dJMcajm1Fc2/lyQeYMU1WkN7sbTzdwPlQ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlTnIa%2FdJMcajm1Fc2%2FlyQeYMU1WkN7sbTzdwPlQ1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;796&quot; height=&quot;476&quot; data-origin-width=&quot;1469&quot; data-origin-height=&quot;878&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;4.1.Datasets and Implementation Details&lt;/span&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Datasets. &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt; &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;▶️ Train Dataset&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;stages1 and 2 training&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;WebVid-2M (일부): 약 40만 개의 텍스트-비디오 쌍을 사용(&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;336&amp;times;596)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Stage3 fine-tuning&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;OpenVid-1M: &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;약 100만 개의 고해상도 비디오-텍스트 쌍(&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;512&amp;times;512 이상)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;YouHQ: &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;약 37,000개의 2K 해상도 비디오, 텍스트 주석 없음&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;다양한 실세계 고화질 영상 포함&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;저해상도(LR) 입력 생성 방식&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;RealBasicVSR의 열화 파이프라인을 사용하여 생성&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;▶️&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; &lt;b&gt;Test Dataset&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;합성 데이터셋 (Synthetic)&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;UDM10, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;YouHQ40&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;실제 데이터셋 (Real-world)&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;MVSR4x, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;RealVideo10 (자체 구축)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;TrainingDetails.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;프레임워크: PyTorch&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;GPU: NVIDIA A100 8장&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;최적화 기법: &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;AdamW optimizer&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;러닝레이트: 1e-4&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;배치 크기: 96&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;입력 구성: &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;8프레임짜리 비디오 세그먼트에서 320&amp;times;320 패치를 랜덤으로 크롭&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;시간&amp;nbsp;간격(temporal&amp;nbsp;stride):&amp;nbsp;1&amp;nbsp;~&amp;nbsp;6&amp;nbsp;프레임&amp;nbsp;간격&amp;nbsp;다양화 &lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&amp;rarr; 다양한 모션 패턴 학습 가능&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;업스케일링 비율: &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;4배&amp;nbsp;비디오&amp;nbsp;초해상도&amp;nbsp;(4&amp;times;&amp;nbsp;VSR)&amp;nbsp;수행 &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt;Evaluation Metrics.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;정량적 품질 평가&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;PSNR&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;Perceptual Quality&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;CLIP-IQA, MUSIQ, NRQM, DOVER&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;Temporal Consistency&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Warping Error&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.2.Ablation Study&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1454&quot; data-origin-height=&quot;855&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bYK4ZF/dJMcajgf4Il/ak3AcNCA984zN5vLGkX2Ek/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bYK4ZF/dJMcajgf4Il/ak3AcNCA984zN5vLGkX2Ek/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bYK4ZF/dJMcajgf4Il/ak3AcNCA984zN5vLGkX2Ek/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbYK4ZF%2FdJMcajgf4Il%2Fak3AcNCA984zN5vLGkX2Ek%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1454&quot; height=&quot;855&quot; data-origin-width=&quot;1454&quot; data-origin-height=&quot;855&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;373&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/68tR3/dJMcafEVxg7/hfesQ0pkvzn4YfSs8tvcXK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/68tR3/dJMcafEVxg7/hfesQ0pkvzn4YfSs8tvcXK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/68tR3/dJMcafEVxg7/hfesQ0pkvzn4YfSs8tvcXK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F68tR3%2FdJMcafEVxg7%2FhfesQ0pkvzn4YfSs8tvcXK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;454&quot; height=&quot;254&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;373&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;705&quot; data-origin-height=&quot;168&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Q2b3I/dJMcac9g3BE/cc6UlCKTJDVsl6k0uFk71k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Q2b3I/dJMcac9g3BE/cc6UlCKTJDVsl6k0uFk71k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Q2b3I/dJMcac9g3BE/cc6UlCKTJDVsl6k0uFk71k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FQ2b3I%2FdJMcac9g3BE%2Fcc6UlCKTJDVsl6k0uFk71k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;369&quot; height=&quot;88&quot; data-origin-width=&quot;705&quot; data-origin-height=&quot;168&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;</description>
      <category>Paper Review/Video Super-Resolution</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/135</guid>
      <comments>https://honey-vision.tistory.com/135#entry135comment</comments>
      <pubDate>Fri, 7 Nov 2025 12:58:08 +0900</pubDate>
    </item>
    <item>
      <title>[논문리뷰] VSRDiff: Learning Inter-Frame Temporal Coherence in Diffusion Model for Video Super-Resolution</title>
      <link>https://honey-vision.tistory.com/122</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-end=&quot;190&quot; data-start=&quot;178&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Abstract&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;최근 DM 디테일 생성 능력 덕분에 VSR에도 도입되고 있지만 diffusion의 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;randomness 때문에 &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; content control와 temporal coherence 에 어려움이 있음.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;기존 DM 기반 VSR 방법들:&lt;/span&gt;&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;inter-frame temporal coherence 무시&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;reconstruction-oriented objective보다는 단순 generative에 초점을 맞추고 있음&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;visual distortion, temporal inconsistency 발생&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;701&quot; data-start=&quot;667&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;본 논문의 제안 방법: VSRDiff 프레임워크&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #456771;&quot;&gt;1. IFAG (Inter-Frame Aggregation Guidance) 모듈&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #456771;&quot;&gt;2. PRS (Progressive Reconstruction Sampling) 전략&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #456771;&quot;&gt;3. FLC (Flow-guided Latent Correction) 모듈&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;1102&quot; data-start=&quot;1087&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;성능 평가&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;REDS4, Vid4 데이터셋에서 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;fidelity와 temporal consistency에서 모두 기존 SOTA보다 뛰어난 성능을 보여줌&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; I. INTRODUCTION &lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;딥러닝 기반의 VSR:&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;픽셀 수준에서의 복원은 되지만 p&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;erceptual quality가 부족하고&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;realistic textures이나 디테일을 재현하지 못한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;최근에는 diffusion 모델 기반 vsr로 발전&lt;/span&gt;&lt;/u&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;randomness에 기반한 생성 방식이기 때문에 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;각 프레임마다 다르게 만들어질 가능성이 높음&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;visual distortion, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;프레임 간 깜빡임 또는 inconsistency 발생&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;u&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;기존 연구들&lt;/span&gt;&lt;/u&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; MGLD, &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;SATeCo, &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;Upscale-A-Video&amp;nbsp;&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;inter-frame coherence 부족, &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;reconstruction-oriented design 결여&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1239&quot; data-origin-height=&quot;840&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bP4DTJ/btsPf3mLUet/ON9068eDdGykKWslmqP6G1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bP4DTJ/btsPf3mLUet/ON9068eDdGykKWslmqP6G1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bP4DTJ/btsPf3mLUet/ON9068eDdGykKWslmqP6G1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbP4DTJ%2FbtsPf3mLUet%2FON9068eDdGykKWslmqP6G1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1239&quot; height=&quot;840&quot; data-origin-width=&quot;1239&quot; data-origin-height=&quot;840&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; II. RELATED WORK &lt;/span&gt;&lt;/h2&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; A. VIDEO SUPER-RESOLUTION &lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;딥러닝 기반 방법은 3가지로 나눌 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Sliding Window-based Methods&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;인접 프레임들을 슬라이딩 윈도우 방식으로 묶어서 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;프레임을 복원하는 방법으로&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; font-size: 16px; letter-spacing: 0px;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;추가적으로 motion estimation이나 motionb compensation을 적용하기도 한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ex) LGFN&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;long-range dependency 학습 불가&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;179&quot; data-start=&quot;138&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;예측된 정보를 활용할 수 없다.&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;593&quot; data-start=&quot;557&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;Recurrent-based Methods&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;593&quot; data-start=&quot;557&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;시간 순서대로 프레임을 순차 처리, 과거 상태를 다음 프레임 복원에 활용하는 방법이다&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;593&quot; data-start=&quot;557&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;sliding-window 기반 방법에 비해 long-term dependency를 학습할 수 있어 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;과거 프레임의 정보를 축적하며 처리함으로써 활용할 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;593&quot; data-start=&quot;557&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; font-size: 16px; letter-spacing: 0px;&quot;&gt;ex) &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; font-size: 16px; letter-spacing: 0px;&quot;&gt;BasicVSR&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #1a5490;&quot;&gt;긴 시퀀스에서는 여전히 long-range dependency 한계&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #1a5490;&quot;&gt;계산량 증가, 병렬 처리 어려움&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; Transformer-based Methods&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;global dependency 학습이 가능하며 spatio-temporal 통합이 뛰어나다는 장점이 있지만&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;모든 딥러닝 기반 방법들은 공통적으로 픽셀 단위에서의 복원에 초점이 맞추어져 있어서&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;perceptual quality를 간과한다는 문제점이 존재한다.&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이러한 문제점들을 해결하고자 SUPERVEGAN과 같은 GAN 기반의 방법이 제안되었다.&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; B. DIFFUSION MODELS&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;211&quot; data-start=&quot;167&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; DMs&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;211&quot; data-start=&quot;167&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;generative power&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;계산 비용이 매우 큼 &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;424&quot; data-start=&quot;375&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;/span&gt;LDMs&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;424&quot; data-start=&quot;375&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;(저차원)latent space에서 diffusion 수행&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;424&quot; data-start=&quot;375&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;계산 비용 크게 절감&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Stable Diffusion &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;LDM 기반 대표 모델&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;대규모 데이터셋(LAION-5B)로 사전 학습&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;폭넓은 prior knowledge 보유&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이미지 생성뿐 아니라 image editing, super-resolution 등 다양한 downstream task 수행 가능&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; C. DIFFUSION MODELS FOR VSR&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;219&quot; data-start=&quot;159&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;기존 연구: Diffusion &amp;rarr; ISR (Image Super-Resolution)&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;219&quot; data-start=&quot;159&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;diffusion의 generative priors을 활용해 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;고화질 이미지 복원 수행&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;219&quot; data-start=&quot;159&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;VSR로의 확장&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;단순한 이미지가 아니라 시간축을 포함하는 비디오에 적용되면서 Visual Distortion, Temporal Inconsistency 발생&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ex) &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;MGLD,&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;SATeCo,&amp;nbsp;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;Upscale-A-Video&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Inter-frame coherence 부족&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Reconstruction-oriented 목적 결여&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;지나친 생성 중심의 복원 &amp;rarr; 원본과 일치도 떨어짐&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Randomness로 인한 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;temporal consistency&lt;/span&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;에 부정적 영향&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; III. OUR APPROACH&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;290&quot; data-start=&quot;274&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;두 가지 목표&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1. Visual Fidelity&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2. Temporal Consistency&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1458&quot; data-origin-height=&quot;816&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bp3Wup/btsPfiZmlD8/WEE0AS98rXwu1HYdld1lYk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bp3Wup/btsPfiZmlD8/WEE0AS98rXwu1HYdld1lYk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bp3Wup/btsPfiZmlD8/WEE0AS98rXwu1HYdld1lYk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbp3Wup%2FbtsPfiZmlD8%2FWEE0AS98rXwu1HYdld1lYk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1458&quot; height=&quot;816&quot; data-origin-width=&quot;1458&quot; data-origin-height=&quot;816&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;A. PRELIMINARY: DIFFUSION MODELS&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;diffusion 수식은 ddpm 논문 참고.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;text-align: start;&quot;&gt; &lt;/span&gt;Forward process&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;497&quot; data-origin-height=&quot;55&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/TQByS/btsPhkuaXin/PyDF4VXjkCWBJU7yhH14u0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/TQByS/btsPhkuaXin/PyDF4VXjkCWBJU7yhH14u0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/TQByS/btsPhkuaXin/PyDF4VXjkCWBJU7yhH14u0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FTQByS%2FbtsPhkuaXin%2FPyDF4VXjkCWBJU7yhH14u0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;317&quot; height=&quot;35&quot; data-origin-width=&quot;497&quot; data-origin-height=&quot;55&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt; Reverse&lt;/span&gt;&lt;span style=&quot;text-align: start;&quot;&gt; process&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;526&quot; data-origin-height=&quot;43&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dp6DGp/btsPf2heJle/Uqc9PppDPDp3Q1Ugpc6CK0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dp6DGp/btsPf2heJle/Uqc9PppDPDp3Q1Ugpc6CK0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dp6DGp/btsPf2heJle/Uqc9PppDPDp3Q1Ugpc6CK0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fdp6DGp%2FbtsPf2heJle%2FUqc9PppDPDp3Q1Ugpc6CK0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;318&quot; height=&quot;26&quot; data-origin-width=&quot;526&quot; data-origin-height=&quot;43&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt; Loss&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;548&quot; data-origin-height=&quot;51&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/B11HZ/btsPgBpFngf/qvywmK8iNGvJsaONoBiRzk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/B11HZ/btsPgBpFngf/qvywmK8iNGvJsaONoBiRzk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/B11HZ/btsPgBpFngf/qvywmK8iNGvJsaONoBiRzk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FB11HZ%2FbtsPgBpFngf%2FqvywmK8iNGvJsaONoBiRzk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;366&quot; height=&quot;34&quot; data-origin-width=&quot;548&quot; data-origin-height=&quot;51&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt; Sampling&lt;/span&gt;&lt;span style=&quot;text-align: start;&quot;&gt; process&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;615&quot; data-origin-height=&quot;92&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/otcdH/btsPfynm8oH/pvvfrxOqtxVWvhLeA7IZ3k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/otcdH/btsPfynm8oH/pvvfrxOqtxVWvhLeA7IZ3k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/otcdH/btsPfynm8oH/pvvfrxOqtxVWvhLeA7IZ3k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FotcdH%2FbtsPfynm8oH%2FpvvfrxOqtxVWvhLeA7IZ3k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;341&quot; height=&quot;51&quot; data-origin-width=&quot;615&quot; data-origin-height=&quot;92&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; B. INTER-FRAME AGGREGATION GUIDANCE&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;인접 프레임들과의 관계를 고려해서 denoising &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;U-Net이 더 정확한 HR 프레임을 생성할 수 있도록 &lt;/span&gt;condition을 제공함&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1451&quot; data-origin-height=&quot;596&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wxfe8/btsPgeaELDs/PdevRVdMiFuEkdVCwkuhF1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wxfe8/btsPgeaELDs/PdevRVdMiFuEkdVCwkuhF1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wxfe8/btsPgeaELDs/PdevRVdMiFuEkdVCwkuhF1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fwxfe8%2FbtsPgeaELDs%2FPdevRVdMiFuEkdVCwkuhF1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1451&quot; height=&quot;596&quot; data-origin-width=&quot;1451&quot; data-origin-height=&quot;596&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1) AGGREGATION ENCODER&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;먼저 LR 시퀀스를 VAE에 통과 후 Aggregation 인코더로 입력(time정보도 같이 입력)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; intermediate&amp;nbsp;feature&amp;nbsp;maps&lt;/span&gt;추출&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;U-Net과 동일한 스케일 구조로 multi-scale convolutional block으로 구성되어있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;처음 입력과 점선 박스로 되어있는 네트워크를 통과한 피처와 더해줌으로써 컨디션을 주는 역할을 한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;2) IFAG MODULE&lt;/span&gt; &lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;Inter-Frame Aggregation Guidance(IFAG)은 그림(a)처럼 전체 모듈을 의미한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;aggregation 인코더를 통과한 피처가 SFT를 통과하고&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; SFT modulation 결과와 기존 피처를 더해준다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;SFT모듈은 Spatial Feature Transformer의 약자로 affine 파라미터를 조절하는 역할을 한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start;&quot;&gt;SFT모듈&lt;/span&gt; &lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;입력을 두 가지 네트워크로 나눠서 통과시키는데 하나는 알파를 위해, 또 다른 하나는 감마를 위함이다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;affine 파라미터는 affine transformer를 조정하는 파라미터라고 할 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;affine transformer라는건 픽셀의 배치를 바꾸는 계산을 의미한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;알파는 Scaling, 감마는 Shifting을 조정한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;321&quot; data-origin-height=&quot;103&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/IeOM0/btsPgw9UcBV/EGO4kxtJp4LsnXWVli0lt0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/IeOM0/btsPgw9UcBV/EGO4kxtJp4LsnXWVli0lt0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/IeOM0/btsPgw9UcBV/EGO4kxtJp4LsnXWVli0lt0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FIeOM0%2FbtsPgw9UcBV%2FEGO4kxtJp4LsnXWVli0lt0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;240&quot; height=&quot;77&quot; data-origin-width=&quot;321&quot; data-origin-height=&quot;103&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;SFT를 통과하는 전체 수식이다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;589&quot; data-origin-height=&quot;89&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bs3RVQ/btsPg6pndC3/3todSdhrdYVKa7tScSj6KK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bs3RVQ/btsPg6pndC3/3todSdhrdYVKa7tScSj6KK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bs3RVQ/btsPg6pndC3/3todSdhrdYVKa7tScSj6KK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbs3RVQ%2FbtsPg6pndC3%2F3todSdhrdYVKa7tScSj6KK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;384&quot; height=&quot;58&quot; data-origin-width=&quot;589&quot; data-origin-height=&quot;89&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;C. PROGRESSIVE RECONSTRUCTION SAMPLING&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;제안하는 PRS는 reconstruction의 관점에서 diffusion 모델의 샘플링 과정을 조절한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start;&quot;&gt;PRS는 점진적 복원 샘플링을 구현하기 위해 샘플링 과정을 초기와 후기 단계로 나눈다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start;&quot;&gt;reconstruction의 관점에서 조절한다는 것&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;time step에 따라 노이즈를 제거하는 것이 기존 방법이라면,&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;본 논문에서는 노이즈를 제거하되 기준에 따라서 LR 정보를 추가하고 추가하지 않는 방법을 제안한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이 기준에 대해서는 아래 부분에서 설명하고 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1455&quot; data-origin-height=&quot;639&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/HzfET/btsPgUbEkbW/W6cciJyUSfO5sIVQAnzuqk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/HzfET/btsPgUbEkbW/W6cciJyUSfO5sIVQAnzuqk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/HzfET/btsPgUbEkbW/W6cciJyUSfO5sIVQAnzuqk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FHzfET%2FbtsPgUbEkbW%2FW6cciJyUSfO5sIVQAnzuqk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1455&quot; height=&quot;639&quot; data-origin-width=&quot;1455&quot; data-origin-height=&quot;639&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) EARLY SAMPLING STAGE&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;reverse process에서 샘플링을 할 때, 설정한 S보다 t가 크면 LR을 주입한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;S가 t보다 작아지는 단계가 되면 LR 정보 없이 오로지 모델이 예측하게 된다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;391&quot; data-origin-height=&quot;56&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/coU7Xn/btsPiKmhnsV/ZTRx8rgg3pvxc5WQTo8bIK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/coU7Xn/btsPiKmhnsV/ZTRx8rgg3pvxc5WQTo8bIK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/coU7Xn/btsPiKmhnsV/ZTRx8rgg3pvxc5WQTo8bIK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcoU7Xn%2FbtsPiKmhnsV%2FZTRx8rgg3pvxc5WQTo8bIK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;349&quot; height=&quot;50&quot; data-origin-width=&quot;391&quot; data-origin-height=&quot;56&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;648&quot; data-origin-height=&quot;86&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/byMNXK/btsPhHqwBmh/qJ74L7bdSfIXfzGKi6OJU1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/byMNXK/btsPhHqwBmh/qJ74L7bdSfIXfzGKi6OJU1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/byMNXK/btsPhHqwBmh/qJ74L7bdSfIXfzGKi6OJU1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbyMNXK%2FbtsPhHqwBmh%2FqJ74L7bdSfIXfzGKi6OJU1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;279&quot; height=&quot;37&quot; data-origin-width=&quot;648&quot; data-origin-height=&quot;86&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;/span&gt;LR 정보를 샘플링에서 추가하는 방법&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;denoising U-Net이 예측한 latent 결과와 원래 LR latent 사이에 가중합을 수행&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;샘플링 time-step에 따라 weight을 조절해서 점점&amp;nbsp;LR에 충실한 HR latent&amp;nbsp;생성&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1061&quot; data-origin-height=&quot;899&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bGGTrw/btsPi2fVDIv/YexzhpqIKDLyhoMwGvOzmk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bGGTrw/btsPi2fVDIv/YexzhpqIKDLyhoMwGvOzmk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bGGTrw/btsPi2fVDIv/YexzhpqIKDLyhoMwGvOzmk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbGGTrw%2FbtsPi2fVDIv%2FYexzhpqIKDLyhoMwGvOzmk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;419&quot; height=&quot;355&quot; data-origin-width=&quot;1061&quot; data-origin-height=&quot;899&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) LATE SAMPLING STAGE&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 가중치 계수가 제거되고 모델이 자유롭게 생성할 수 있는 단계&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; D. FLOW-GUIDED LATENT CORRECTION&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이전 연구들은 optical flow 기반의 feature propagation이 temporal consistency을 효과적으로 유지할 수 있음을 보여주었다.&lt;br /&gt;현재 diffusion 모델에서는 인접한 프레임 사이에서만 propagation하기 때문에 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;여전히 inconsistency가 존재한다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;BasicVSR++에서 사용된 second-order optical flow propagation 접근법은 diffusion 모델에 적용하기 어려움이 있다. &lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;BasicVSR++&lt;/span&gt; 아카이브 논문⬇️&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2104.13371&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2104.13371&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;br /&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;▪️&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;이러한 한계를 해결하기 위해 새로운 Flow-guided Latent Correction(FLC) 모듈 제안&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;▪️&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;optical flow를 사용해서 latent space 내에서 인접 프레임과 교차 프레임 간의 second-order bidirectional latent propagation 수행&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;FLC 모듈 동작 순서&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;▪️&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;LR의 bidirectional optical flow를 추정하기 위해 RAFT 모델을 사용한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;▪️&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;occlusion을 처리하고 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;feature propagation&lt;/span&gt;를 보장하기 위해 forward-backward consistency check를 통해 각 프레임의 occlusion mask(M)도 추정한다. 그 후 latent space에 맞게 다운샘플링 된다.&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;▪️&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;얻은 여러 프레임의 latent 정보는 일시적으로 hidden state에 저장된다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;719&quot; data-origin-height=&quot;647&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bPEL5Y/btsPhlBl9XQ/n9kbCoOvGkz6eAVLopJycK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bPEL5Y/btsPhlBl9XQ/n9kbCoOvGkz6eAVLopJycK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bPEL5Y/btsPhlBl9XQ/n9kbCoOvGkz6eAVLopJycK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbPEL5Y%2FbtsPhlBl9XQ%2Fn9kbCoOvGkz6eAVLopJycK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;435&quot; height=&quot;391&quot; data-origin-width=&quot;719&quot; data-origin-height=&quot;647&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 1) FIRST-ORDER LATENT PROPAGATION&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 현재 프레임을 과거 프레임으로부터 warp하여 보정하는 과정이다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;i-1번째에서 i번째 프레임으로의 backward optical flow를 계산하고 latent로 wrap한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;그 후에는 현재 프레임의 latent와 fusion하고 파라미터 뮤로 과거 프레임을 얼마나 반영할 지를 조절한다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;element-wise 곱을 통해서 occlusion mask의 여부에 따라 [0,1]의 수를 곱해준다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;536&quot; data-origin-height=&quot;107&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bB4i5o/btsPh7JvUt6/u4kq9Gto0nwqh8nF9ovjdk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bB4i5o/btsPh7JvUt6/u4kq9Gto0nwqh8nF9ovjdk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bB4i5o/btsPh7JvUt6/u4kq9Gto0nwqh8nF9ovjdk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbB4i5o%2FbtsPh7JvUt6%2Fu4kq9Gto0nwqh8nF9ovjdk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;281&quot; height=&quot;56&quot; data-origin-width=&quot;536&quot; data-origin-height=&quot;107&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 2) SECOND-ORDER LATENT PROPAGATION&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; i-2번째에서 i-1번째 프레임으로의 backward optical flow를 계산하고 latent로 wrap한다. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;600&quot; data-origin-height=&quot;114&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bytB5s/btsPiszRNuu/4CplUfB7pdsq16s607TNck/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bytB5s/btsPiszRNuu/4CplUfB7pdsq16s607TNck/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bytB5s/btsPiszRNuu/4CplUfB7pdsq16s607TNck/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbytB5s%2FbtsPiszRNuu%2F4CplUfB7pdsq16s607TNck%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;279&quot; height=&quot;53&quot; data-origin-width=&quot;600&quot; data-origin-height=&quot;114&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;580&quot; data-origin-height=&quot;51&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/baP3Y8/btsPirHNMn2/7izPpRaTFQWDQsR8on8hD0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/baP3Y8/btsPirHNMn2/7izPpRaTFQWDQsR8on8hD0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/baP3Y8/btsPirHNMn2/7izPpRaTFQWDQsR8on8hD0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbaP3Y8%2FbtsPirHNMn2%2F7izPpRaTFQWDQsR8on8hD0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;296&quot; height=&quot;26&quot; data-origin-width=&quot;580&quot; data-origin-height=&quot;51&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; E. TRAINING STRATEGY&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;pre-trained Stable Diffusion 모델에 1D temporal convolution을 통합하여 시간적 모델링을 개선하고, denoising U-Net과 VAE 디코더를 fine-tune한다. VSRDiff의 학습은 두 단계로 진행된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 1단계 학습&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;denoising U-Net과 IFAG 모듈 학습&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;denoising U-Net의 가중치는 Stable Diffusion V2.1 (512-base-ema)으로부터 초기화된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 학습 과정 동안 U-Net의 모든 파라미터는 frozen되며, &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;1D temporal convolution&lt;/span&gt;의 파라미터만 학습된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;diffusion loss&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;556&quot; data-origin-height=&quot;54&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bk5JC2/btsPg8WBy4I/KqnkprnTJ5Skv2eUp4e7C1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bk5JC2/btsPg8WBy4I/KqnkprnTJ5Skv2eUp4e7C1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bk5JC2/btsPg8WBy4I/KqnkprnTJ5Skv2eUp4e7C1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbk5JC2%2FbtsPg8WBy4I%2FKqnkprnTJ5Skv2eUp4e7C1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;319&quot; height=&quot;31&quot; data-origin-width=&quot;556&quot; data-origin-height=&quot;54&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt; 2단계 학습&lt;/b&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;fine-tuned VAE 디코더를 학습시켜 latent space에서 pixel space로 비디오를 복원하는 정확도를 향상시킨다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1단계에서 학습된 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;denoising &lt;/span&gt;U-Net과 IFAG 모듈을 사용하여 latent sequence를 생성한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;latent sequence&lt;/span&gt;와 해당하는 LR를 사용하여 VAE 디코더를 학습한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;/span&gt;총 손실 = recon 손실, perceptual 손실, gan손실&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;660&quot; data-origin-height=&quot;196&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ba8oAK/btsPhHxvpQ4/cPPlxJw0Utnf6CtHIH8WL1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ba8oAK/btsPhHxvpQ4/cPPlxJw0Utnf6CtHIH8WL1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ba8oAK/btsPhHxvpQ4/cPPlxJw0Utnf6CtHIH8WL1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fba8oAK%2FbtsPhHxvpQ4%2FcPPlxJw0Utnf6CtHIH8WL1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;354&quot; height=&quot;105&quot; data-origin-width=&quot;660&quot; data-origin-height=&quot;196&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;IV. EXPERIMENTS&lt;/span&gt;&lt;/h2&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;A. EXPERIMENTAL SETTINGS&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1) IMPLEMENTATION DETAILS&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;프레임워크: PyTorch&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;GPU: NVIDIA RTX 3090, 4개 사용 (병렬 처리 가능)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;batch size: 4&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;입력 LR 프레임 수: 6개&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;LR 이미지 해상도: 512 &amp;times; 512&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;latent space 크기: 64 &amp;times; 64 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;rarr; VAE로 인코딩된 결과&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Optimizer: Adam&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;learning rate: 5.0 &amp;times; 10⁻⁵&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Noise Scheduler for Diffusion&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;형태: Linear 스케줄러&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;시작 값: &amp;beta;₀ = 0.00085&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;끝 값: &amp;beta;ₜ = 0.012&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;시간 단계 수 (T): 1000&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;IFAG (Inter-Frame Aggregation Guidance)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;인접 프레임 수 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;rarr; 즉, 현재 프레임을 중심으로 앞뒤 각각 2개, 총 5개 프레임 사용&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;프레임 가중치 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;=&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;0.3 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;rarr; SFT modulation에서 적용되는 가중치&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Sampling 단계 수: 50 steps (diffusion 반복 횟수)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&amp;tau; = 4: PRS에서 시간 기반 weight 조절용&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;S = 25: sampling 단계의 중간 전환점 (early &amp;rarr; late)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&amp;mu;₁ = 0.2: 1차 warp latent 가중치&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&amp;mu;₂ = 0.1: 2차 warp latent 가중치&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;해상도 대응 방법: Progressive patch aggregation sampling&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;StableSR에서 착안&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;다양한 해상도의 비디오도 샘플링 가능하게 함&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;예: 긴 비디오를 patch 단위로 나눠서 조각 샘플링 후 재조립&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2) DATASETS&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt; &lt;/b&gt;REDS&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;240개의 train 비디오 클립, 30개의 val 클립, 30개의 test 클립&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;각 클립은 1280 &amp;times; 720 해상도의 프레임 100개로 구성된다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;266개 클립을 학습에 사용하고, REDS4(000, 011, 015, 020)로 알려진 4개 클립을 validation 및 test에 사용한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt; &lt;/b&gt;Vid4&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;길이와 해상도가 다양한 4개의 비디오 클립&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;각 클립은 약 40개의 프레임과 720 &amp;times; 480의 해상도를 가진다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Vid4의 경우, Vimeo-90K 데이터셋에서 학습된 모델을 평가하기 위한 테스트 세트로 4개 클립(calendar, city, foliage, walk)을 모두 사용한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 3) EVALUATION METRICS&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt; Full-Reference Metrics &lt;/b&gt; &lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;border-collapse: collapse; width: 100%;&quot; border=&quot;1&quot; data-ke-align=&quot;alignLeft&quot;&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;LPIPS (Learned Perceptual Image Patch Similarity)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;시각적 유사도 평가&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;VGG 네트워크를 이용해 특징(feature) 공간에서 유사도 평가. 사람 시각에 가까운 지표. 낮을수록 좋음.&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;DISTS (Deep Image Structure and Texture Similarity)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;구조 + 텍스처 중심 평가&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;구조적 일관성과 질감 유사성 모두 반영. LPIPS보다 텍스처를 더 중시. 낮을수록 좋음.&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;PSNR (Peak Signal-to-Noise Ratio)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;픽셀 정확도&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;GT와의 차이를 픽셀 기반으로 측정. 숫자가 클수록 좋음.&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;SSIM (Structural Similarity Index)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;구조적 유사성&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;지역 밝기, 대비, 구&lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;조 정보 비교. 높을수록 GT와 비슷함.&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;  No-Reference Metrics&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;border-collapse: collapse; width: 100%;&quot; border=&quot;1&quot; data-ke-align=&quot;alignLeft&quot;&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;NIQE (Natural Image Quality Evaluator)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;자연 이미지의 통계 분포와 비교&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;낮을수록 자연스러움이 높음.&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;자연 장면 통계 기반의 공간 도메인 특징 추출&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;NIQE와 유사. 낮을수록 좋음.&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;CLIP-IQA&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;CLIP 모델로 이미지 &amp;harr; 프롬프트 유사도 측정&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;color: #000000; font-family: GungSeo, serif;&quot;&gt;&quot;Good image&quot;, &quot;Sharp image&quot; 등 프롬프트와 생성 프레임을 임베딩 비교. 클수록 좋음.&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;B. COMPARISONS WITH EXISTING METHODS&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt; 전통적인 VSR 방법&lt;/b&gt;: BasicVSR, VRT, BasicVSR++&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;b&gt; Diffusion 모델 기반 VSR 방법&lt;/b&gt;: StableSR, MGLD, SeeClear, SATeCo&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;888&quot; data-origin-height=&quot;580&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eFQmun/btsPiyGCSHh/EZXz15dIYEM1KR8yRduolk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eFQmun/btsPiyGCSHh/EZXz15dIYEM1KR8yRduolk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eFQmun/btsPiyGCSHh/EZXz15dIYEM1KR8yRduolk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FeFQmun%2FbtsPiyGCSHh%2FEZXz15dIYEM1KR8yRduolk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;705&quot; height=&quot;460&quot; data-origin-width=&quot;888&quot; data-origin-height=&quot;580&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1324&quot; data-origin-height=&quot;447&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bJoRd8/btsPiCvq7hI/pMJWy5peftUzp8x6zpntNK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bJoRd8/btsPiCvq7hI/pMJWy5peftUzp8x6zpntNK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bJoRd8/btsPiCvq7hI/pMJWy5peftUzp8x6zpntNK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbJoRd8%2FbtsPiCvq7hI%2FpMJWy5peftUzp8x6zpntNK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;661&quot; height=&quot;223&quot; data-origin-width=&quot;1324&quot; data-origin-height=&quot;447&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bbOWyZ/btsPipC90OE/4lKKV97kkLBPlPMPZaQCX0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bbOWyZ/btsPipC90OE/4lKKV97kkLBPlPMPZaQCX0/img.png&quot; width=&quot;355&quot; height=&quot;324&quot; data-origin-width=&quot;663&quot; data-origin-height=&quot;605&quot; data-is-animation=&quot;false&quot; style=&quot;width: 50.6419%; margin-right: 10px;&quot; data-widthpercent=&quot;51.24&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bbOWyZ/btsPipC90OE/4lKKV97kkLBPlPMPZaQCX0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbbOWyZ%2FbtsPipC90OE%2F4lKKV97kkLBPlPMPZaQCX0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;663&quot; height=&quot;605&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/r4n5g/btsPhwbIgYW/J3EYJpekmiPSTCj0Pkvpjk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/r4n5g/btsPhwbIgYW/J3EYJpekmiPSTCj0Pkvpjk/img.png&quot; data-origin-width=&quot;656&quot; data-origin-height=&quot;629&quot; data-is-animation=&quot;false&quot; width=&quot;351&quot; height=&quot;337&quot; style=&quot;width: 48.1953%;&quot; data-widthpercent=&quot;48.76&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/r4n5g/btsPhwbIgYW/J3EYJpekmiPSTCj0Pkvpjk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fr4n5g%2FbtsPhwbIgYW%2FJ3EYJpekmiPSTCj0Pkvpjk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;656&quot; height=&quot;629&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;br /&gt;&lt;br /&gt;2) QUALITATIVE COMPARISON&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1282&quot; data-origin-height=&quot;753&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/BqpP2/btsPhrO9ikq/8AsMvk0v1a2pqRevkaUjRk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/BqpP2/btsPhrO9ikq/8AsMvk0v1a2pqRevkaUjRk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/BqpP2/btsPhrO9ikq/8AsMvk0v1a2pqRevkaUjRk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBqpP2%2FbtsPhrO9ikq%2F8AsMvk0v1a2pqRevkaUjRk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1282&quot; height=&quot;753&quot; data-origin-width=&quot;1282&quot; data-origin-height=&quot;753&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1282&quot; data-origin-height=&quot;756&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bABsZd/btsPhFzGJmx/kyyeduXBjx3fm4daKljzG0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bABsZd/btsPhFzGJmx/kyyeduXBjx3fm4daKljzG0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bABsZd/btsPhFzGJmx/kyyeduXBjx3fm4daKljzG0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbABsZd%2FbtsPhFzGJmx%2FkyyeduXBjx3fm4daKljzG0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1282&quot; height=&quot;756&quot; data-origin-width=&quot;1282&quot; data-origin-height=&quot;756&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1152&quot; data-origin-height=&quot;812&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dWfvuj/btsPiwB3yYk/zkIvlCKL3vN2qox26soiB1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dWfvuj/btsPiwB3yYk/zkIvlCKL3vN2qox26soiB1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dWfvuj/btsPiwB3yYk/zkIvlCKL3vN2qox26soiB1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdWfvuj%2FbtsPiwB3yYk%2FzkIvlCKL3vN2qox26soiB1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1152&quot; height=&quot;812&quot; data-origin-width=&quot;1152&quot; data-origin-height=&quot;812&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1170&quot; data-origin-height=&quot;747&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dPiUPS/btsPg56FRhS/eV3DoYlxwHmaJFfHq9PnS1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dPiUPS/btsPg56FRhS/eV3DoYlxwHmaJFfHq9PnS1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dPiUPS/btsPg56FRhS/eV3DoYlxwHmaJFfHq9PnS1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdPiUPS%2FbtsPg56FRhS%2FeV3DoYlxwHmaJFfHq9PnS1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1170&quot; height=&quot;747&quot; data-origin-width=&quot;1170&quot; data-origin-height=&quot;747&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/l7sWv/btsPjl0CqI9/RkiYLZhLJlq3PuTXiXSP51/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/l7sWv/btsPjl0CqI9/RkiYLZhLJlq3PuTXiXSP51/img.png&quot; data-origin-width=&quot;650&quot; data-origin-height=&quot;940&quot; data-is-animation=&quot;false&quot; width=&quot;345&quot; height=&quot;499&quot; style=&quot;width: 53.5677%; margin-right: 10px;&quot; data-widthpercent=&quot;54.2&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/l7sWv/btsPjl0CqI9/RkiYLZhLJlq3PuTXiXSP51/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fl7sWv%2FbtsPjl0CqI9%2FRkiYLZhLJlq3PuTXiXSP51%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;650&quot; height=&quot;940&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ecCeQi/btsPitFxNui/0BKmfQ0EVRShVN8yhv3Jbk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ecCeQi/btsPitFxNui/0BKmfQ0EVRShVN8yhv3Jbk/img.png&quot; data-origin-width=&quot;658&quot; data-origin-height=&quot;1126&quot; data-is-animation=&quot;false&quot; style=&quot;width: 45.2695%;&quot; data-widthpercent=&quot;45.8&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ecCeQi/btsPitFxNui/0BKmfQ0EVRShVN8yhv3Jbk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FecCeQi%2FbtsPitFxNui%2F0BKmfQ0EVRShVN8yhv3Jbk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;658&quot; height=&quot;1126&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/Video Super-Resolution</category>
      <category>Diffusion</category>
      <category>Paper review</category>
      <category>VSR</category>
      <category>vsrdiff</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/122</guid>
      <comments>https://honey-vision.tistory.com/122#entry122comment</comments>
      <pubDate>Sun, 13 Jul 2025 17:28:49 +0900</pubDate>
    </item>
    <item>
      <title>[논문리뷰] ReactFace: Online Multiple Appropriate Facial Reaction Generation in Dyadic Interactions</title>
      <link>https://honey-vision.tistory.com/116</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Abstract&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;두 사람의 상호작용에서 청자의 얼굴 반응을 예측하는 것은 사람마다 반응이 다르기 때문에 어려운 문제라고 볼 수 있다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이전 접근법&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;interpolation 또는 fitting 문제로 다루었다&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;다양한 얼굴 반응과 불확실성을 무시하고 결정론적인 결과를 강조했다.&lt;/span&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start;&quot;&gt; &lt;/span&gt;Fitting: 입력이 주어지면 출력이 단 하나의 정답이라고 가정 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;모델은 사람이 실제로 지은 표정과 모델이 생성한 표정 사이의 오차를 최소화하는 방향으로 학습된다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;그 결과, 모델은 훈련 데이터에 있는 반응들의 평균값 또는 가장 확률이 높은 하나의 값을 생성하게 된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start;&quot;&gt; &lt;/span&gt;Interpolation: 훈련 데이터에서 본 반응들 사이의 중간 값을 채워 넣는 방식&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;주어진 상황에 대해 가장 그럴듯한 하나의 정해진 반응을 예측하는 접근 방식의 일부다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;문제점&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;상호작용의 맥락 내에서 short-range 및 long-range dependencies 모델링 어려움&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;생성된 얼굴 반응의 동기화 및 적절성 문제 초래&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;본 논문에서 제안한 방법&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;1)&lt;/span&gt;&amp;nbsp;Extrapolation 또는 prediction 문제로 재구성&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;2)&lt;/span&gt; 화자의 행동에서 단순히 청자의 얼굴 행동을 복제하는 것이 아니라 여러 가지로 적절한 얼굴 반응을 생성하는 ReactFace 제안&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;ReactFace&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;1) &lt;/span&gt;적절한 얼굴 반응 분포 학습&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;2)&lt;/span&gt; 생성된 얼굴 반응을 각 타임스탬프에서 화자의 언어적 및 비언어적 행동과 동기화하여 현실적인 2D 얼굴 반응 시퀀스 생성&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;코드는 아래 깃허브 링크를 통해 확인할 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;a style=&quot;color: #000000;&quot; href=&quot;https://github.com/lingjivoo/ReactFace에서&quot;&gt;https://github.com/lingjivoo/ReactFace&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1&amp;nbsp;INTRODUCTION&lt;/span&gt;&lt;/h2&gt;
&lt;blockquote data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-style=&quot;style2&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;공감 연구가 어려운 이유&lt;/span&gt;&lt;/b&gt;&lt;/blockquote&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 공감은 상대방의 모달리티 정보와 맥락적 요소에 영향을 받기 때문에 &lt;u&gt;같은 화자에 대해서도 혹은 같은 청자에게서도 다른 반응이 발생된다. &lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 얼굴 반응 생성을 &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;결정론적 결과를 갖는 일대일(one-to-one)이 아닌 &lt;/span&gt;&lt;u&gt;일대다(one-to-many) 매핑&lt;/u&gt;으로 다루어야 한다.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;이전 연구: &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;online&amp;nbsp;facial&amp;nbsp;reaction&amp;nbsp;generation&lt;/span&gt;&lt;/u&gt; &lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Autoregressive 또는 segment-by-segment로 &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;실시간으로 청자의 얼굴 반응 시퀀스를 &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;즉시 생성하도록 설계됐다.&lt;/span&gt;&lt;/b&gt; &lt;span style=&quot;color: #000000;&quot;&gt;주로 CGANs을 활용했으며 화자의 정보를 조건 신호로 사용하여 청자의 얼굴 반응 프레임을 재현하는 데 초점을 맞추었다.&lt;/span&gt; &lt;br /&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;&lt;u&gt;이전 연구의 문제점:&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 화자의 얼굴 표정이 시간에 따라 변화하는 특성을 고려하지 않았다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 화자의 다른 모달리티를 반영하지 않았다. &lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;문제점을 해결하고자 제안된 방법들:&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- LSTM(Long Short-Term Memory)과 같은 시간적 네트워크 도입&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 화자의 음성 또는 텍스트에서 추출한 정보와 결합하는 방식으로 발전&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #781b33;&quot;&gt;&lt;b&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;그럼에도 해결되지 못한 문제점 1:&lt;/span&gt;&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;화자의 행동과 청자의 얼굴 반응 간의 동기화(synchrony)를 포착하는 데는 여전히 부족함이 있다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #781b33;&quot;&gt; &lt;b&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;그럼에도 해결되지 못한 문제점 2:&lt;/span&gt;&lt;/u&gt;&lt;/b&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;대부분의 기존 연구들은 {화자, 청자}을 직접 짝지어 학습하는 전략을 공통적으로 사용해왔다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;- 같은 화자의 입력이 서로 다른 청자의 얼굴 반응 레이블과 짝지어질 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;- 생성된 얼굴 반응이 주어진 대화 맥락에서 적절한지 고려하지 않는다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;224&quot; data-start=&quot;27&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt; &lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;본 논문에서 제안하는 방법:&lt;/span&gt;&lt;/u&gt; &lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;576&quot; data-start=&quot;480&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;- &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: justify;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;화자-청자 BS 전략 제안&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: justify;&quot;&gt;&amp;nbsp;&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;576&quot; data-start=&quot;480&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;- 일대다 매핑을 위한 AFRG 메커니즘 제안&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;576&quot; data-start=&quot;480&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;- 학습된 분포로부터 서로 다르면서도 적절한 다양한 얼굴 반응들을 샘플링할 수 있게 된다. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;780&quot; data-origin-height=&quot;803&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ckmTtF/btsO0VWEz6v/G4JC9bKMFmg9CVNGqbg72K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ckmTtF/btsO0VWEz6v/G4JC9bKMFmg9CVNGqbg72K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ckmTtF/btsO0VWEz6v/G4JC9bKMFmg9CVNGqbg72K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FckmTtF%2FbtsO0VWEz6v%2FG4JC9bKMFmg9CVNGqbg72K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;528&quot; height=&quot;544&quot; data-origin-width=&quot;780&quot; data-origin-height=&quot;803&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2 RELATED WORK&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.1 얼굴 반응 생성 및 얼굴 행동 생성에 대한 기존 접근 방식 리뷰&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.2 (제스처 및 신체 동작 생성)다른 비언어적 행동 연구 리뷰&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.3 조건부 생성 모델&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.4 modality alignment에 사용된 기법 리뷰&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.1 Automatic Facial Reaction Generation&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;기존 연구 방법:&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;화자의 행동을 입력으로 하여 청자의 실제(GT: Ground Truth) 얼굴 반응을 재현하려는 시도를 중심으로 이루어져 왔다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;최근 연구들:&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) 3DMM(3D Morphable Model) 계수를 사용하여 얼굴 근육의 움직임을 시각화하는 연구&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;2) &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;화자의 여러 모달리티가 입력에 포함되어 청자의 얼굴 반응을 만드는 데 언어적&amp;middot;비언어적 단서를 제공하고 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;청자의 특성에 최적화된 네트워크를 탐색하는 연구&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;본질적인 문제점&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) 얼굴 반응은 사람마다 다르고 같은 청자라도 다른 반응을 유발할 수 있기 때문에&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 화자의 시퀀스만을 가지고 청자의 GT 얼굴 반응을 재현하려는 모델을 학습하는 것은 본질적으로 문제가 있는 접근이다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) 화자가 아닌 청자의 얼굴 반응은 생성하지 못한다.&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.2 Non-verbal&amp;nbsp;Human&amp;nbsp;body/gesture&amp;nbsp;Behaviour&amp;nbsp;Generation&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이전 연구:&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;생성된 동작을 주로 3D 스켈레톤, 비디오 프레임 또는 3D 파라미터의 형태로 표현한다. &lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;u&gt;최근 연구:&lt;/u&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;GAN, VAE, Normalizing Flows부터 Diffusion Models까지도 새롭게 도입되고 있다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;Motion synthesis 접근 방식&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) Unconstrained generation: 특정 조건 없는 일반적인 임의의 동작 생성&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) Conditioned generation:&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; 주어진 조건에 어울리는&amp;nbsp;동작 생성&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;결정론적 생성 모델과 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;비결정론적 생성 모델이 생성한 샘플들은 &lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;종종 insufficient diversity를 겪는다.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start;&quot;&gt;  I&lt;span style=&quot;color: #000000;&quot;&gt;nsufficient diversity: &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;다양한&lt;/span&gt; 결과 생성이 가능하지만 실제로는 훈련 데이터에서 가장 평균적인 동작을 생성하며, 다양한 결과를 만들지 못하는 경우가 많은 문제점 발생&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1179&quot; data-origin-height=&quot;775&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/l7lK0/btsO1Tj5ddG/Bnma8NJ0gVaiveUnkZsHp0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/l7lK0/btsO1Tj5ddG/Bnma8NJ0gVaiveUnkZsHp0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/l7lK0/btsO1Tj5ddG/Bnma8NJ0gVaiveUnkZsHp0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fl7lK0%2FbtsO1Tj5ddG%2FBnma8NJ0gVaiveUnkZsHp0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;446&quot; height=&quot;293&quot; data-origin-width=&quot;1179&quot; data-origin-height=&quot;775&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.3 Conditional Generative Models&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;최근 연구들:&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;다양한 모달리티를 통합하여 생성된 결과가 조건과 일치하도록 발전해 왔다.&lt;/span&gt;&lt;br /&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;초기 접근 방식: &lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 클래스 레이블을 활용하여 이미지를 구분하고 생성된 결과가 속성을 갖도록 유도했다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 클래스 레이블을 조건 신호로 사용하여 입력에 concatenation 또는 conditional normalization을 활용해 생성 과정에 조건을 부여한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- Conditional diffusion models 역시 클래스 정보를 normalization layers에 통합하고 classifier의 gradient를 통해 생성 과정을 유도한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 분류기를 사용하지 않고도 생성 모델 자체로부터 guidance를 얻을 수 있는 방식인 classifier-free guidance도 연구되었다.&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.4 Modality Alignment in Generative Models&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;Modality alignment:&lt;/u&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;모달리티 간 일관된 콘텐츠 생성 및 조작을 가능하게 하지만 semantic 차이와 dimensional 차이로 인해 어려움이 있다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;1) Semantic 차이를 해결하기 위한 방법: &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Contrastive learning 기법 활용&lt;/span&gt;&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2) &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;모달리티 간 dim mismatch: Cross-modal attention 메커니즘 활용&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이 외에, &lt;u&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;Multimodal transformers 방식&lt;/b&gt;&lt;/span&gt;&lt;/u&gt;도 활용되고 있다.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;&lt;u&gt;Temporal alignment&lt;/u&gt;&lt;/b&gt;&lt;/span&gt; 측면에서도 새로운 어텐션 방식&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- ALiBi: Attention with Linear Biases&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- Biased Cross-Modal Attention&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #781b33;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;두 명 간의 상호작용을 모델링하기에 어려움이 있다.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3 PROBLEM FORMULATION&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;MAFRG의 목표:&lt;/u&gt; 주어진 화자의 시퀀스에 반응하여 얼굴 반응을 생성할 수 있는 모델을 개발하는 것&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;문제를 수식으로 정의하면 다음과 같다:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;774&quot; data-origin-height=&quot;124&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b1EKgd/btsO0W2TPOX/tUKIFR2aqKuxv0k7KKnNjK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b1EKgd/btsO0W2TPOX/tUKIFR2aqKuxv0k7KKnNjK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b1EKgd/btsO0W2TPOX/tUKIFR2aqKuxv0k7KKnNjK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb1EKgd%2FbtsO0W2TPOX%2FtUKIFR2aqKuxv0k7KKnNjK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;193&quot; height=&quot;31&quot; data-origin-width=&quot;774&quot; data-origin-height=&quot;124&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4 METHODOLOGY&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;청자의 얼굴 반응을 생성하는 것은 cognitive processes로 인해 시간 delay가 발생하기 때문에 본 논문에서는 small time window를 정의한다(w=8).&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;화자의 audio-visual은 두 부분으로 나눌 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;528&quot; data-origin-height=&quot;68&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zelDe/btsO0x2Zh2h/n51UtvdfYDTvkQfnjVmziK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zelDe/btsO0x2Zh2h/n51UtvdfYDTvkQfnjVmziK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zelDe/btsO0x2Zh2h/n51UtvdfYDTvkQfnjVmziK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzelDe%2FbtsO0x2Zh2h%2Fn51UtvdfYDTvkQfnjVmziK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;225&quot; height=&quot;29&quot; data-origin-width=&quot;528&quot; data-origin-height=&quot;68&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;u&gt;1) 이전 시간 구간 동안 표현된 화자의 audio-facial behavior&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;572&quot; data-origin-height=&quot;72&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ZXtuH/btsO2wIpWK9/Ml5RlkVaNQeWGRKbIntKCk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ZXtuH/btsO2wIpWK9/Ml5RlkVaNQeWGRKbIntKCk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ZXtuH/btsO2wIpWK9/Ml5RlkVaNQeWGRKbIntKCk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FZXtuH%2FbtsO2wIpWK9%2FMl5RlkVaNQeWGRKbIntKCk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;206&quot; height=&quot;26&quot; data-origin-width=&quot;572&quot; data-origin-height=&quot;72&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;u&gt;2) 현재 시간 구간 동안 표현된 화자의 &lt;span style=&quot;text-align: start;&quot;&gt;audio-facial behavior&lt;/span&gt; &lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;684&quot; data-origin-height=&quot;63&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lcaNV/btsO2E7hGhl/f6Mai9IK6Dlr5Py7Fy3qD0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lcaNV/btsO2E7hGhl/f6Mai9IK6Dlr5Py7Fy3qD0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lcaNV/btsO2E7hGhl/f6Mai9IK6Dlr5Py7Fy3qD0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlcaNV%2FbtsO2E7hGhl%2Ff6Mai9IK6Dlr5Py7Fy3qD0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;239&quot; height=&quot;22&quot; data-origin-width=&quot;684&quot; data-origin-height=&quot;63&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;따라서 다음과 같이도 표현할 수 있다:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;381&quot; data-origin-height=&quot;64&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zfjuE/btsO1VCcCB9/PuiT0pfnaTIFpasObvXBz1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zfjuE/btsO1VCcCB9/PuiT0pfnaTIFpasObvXBz1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zfjuE/btsO1VCcCB9/PuiT0pfnaTIFpasObvXBz1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzfjuE%2FbtsO1VCcCB9%2FPuiT0pfnaTIFpasObvXBz1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;113&quot; height=&quot;19&quot; data-origin-width=&quot;381&quot; data-origin-height=&quot;64&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;X: 원래의 audio 신호F: 2D 얼굴 이미지 시퀀스&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ReactFace 모델은 &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;3가지 정보를 기반으로 &lt;u&gt;현재&lt;/u&gt; 어떻게 반응할지를 보여주는 &lt;/span&gt;3D 얼굴 반응 세그먼트를 생성한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) 이전 generated/predicted 얼굴 반응 시퀀스&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) 이전 화자 행동&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;830&quot; data-origin-height=&quot;81&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/sdXh2/btsO1UDluMa/N9kDlow504LGziHz0RuCC1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/sdXh2/btsO1UDluMa/N9kDlow504LGziHz0RuCC1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/sdXh2/btsO1UDluMa/N9kDlow504LGziHz0RuCC1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FsdXh2%2FbtsO1UDluMa%2FN9kDlow504LGziHz0RuCC1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;297&quot; height=&quot;29&quot; data-origin-width=&quot;830&quot; data-origin-height=&quot;81&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-end=&quot;1501&quot; data-start=&quot;1476&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;1501&quot; data-start=&quot;1476&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이후 하위 섹션에서는 다음 내용을 설명한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1501&quot; data-start=&quot;1476&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.1&amp;nbsp;ReactFace 전체 프레임워크 개요&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1501&quot; data-start=&quot;1476&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.2 다중 얼굴 반응 생성 전략&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;1501&quot; data-start=&quot;1476&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.3 화자-청자 행동 동기화 모듈&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.1&amp;nbsp;The&amp;nbsp;ReactFace&amp;nbsp;Framework&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ReactFace의 전체 프레임워크는 네 가지 주요 모듈로 구성된다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1176&quot; data-origin-height=&quot;512&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/AMTai/btsO2EGfA7i/Rvkst4LDXjhUX1Ezs20C7K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/AMTai/btsO2EGfA7i/Rvkst4LDXjhUX1Ezs20C7K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/AMTai/btsO2EGfA7i/Rvkst4LDXjhUX1Ezs20C7K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FAMTai%2FbtsO2EGfA7i%2FRvkst4LDXjhUX1Ezs20C7K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1176&quot; height=&quot;512&quot; data-origin-width=&quot;1176&quot; data-origin-height=&quot;512&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1) MSBEA: Multi-modal Speaker Behavior Encoding and Alignmen&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;545&quot; data-origin-height=&quot;638&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/AVHGv/btsO1GyDqQ1/78A7qfulgTWfiwKUrxZcEk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/AVHGv/btsO1GyDqQ1/78A7qfulgTWfiwKUrxZcEk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/AVHGv/btsO1GyDqQ1/78A7qfulgTWfiwKUrxZcEk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FAVHGv%2FbtsO1GyDqQ1%2F78A7qfulgTWfiwKUrxZcEk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;392&quot; height=&quot;459&quot; data-origin-width=&quot;545&quot; data-origin-height=&quot;638&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;화자의 오디오와 얼굴 표정 비디오를 입력 받아 임베딩으로 인코딩한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;SE:&lt;/u&gt; pretrained wav2vec2.0 활용&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;k는 음성 프레인을 추출할 때 사용되는 샘플링 비율로 얼굴 프레임보다 k배 더 자주 샘플링됨을 의미한다(음성 인코더 설정에 따라 달라짐).&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;2D FSE:&lt;/u&gt; 3D convolution layer와 transformer layer로 구성되어 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;407&quot; data-origin-height=&quot;143&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d5SqPo/btsO1FGtU6s/YvtpkkIdz0WXS8R8jU0xLK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d5SqPo/btsO1FGtU6s/YvtpkkIdz0WXS8R8jU0xLK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d5SqPo/btsO1FGtU6s/YvtpkkIdz0WXS8R8jU0xLK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd5SqPo%2FbtsO1FGtU6s%2FYvtpkkIdz0WXS8R8jU0xLK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;134&quot; height=&quot;47&quot; data-origin-width=&quot;407&quot; data-origin-height=&quot;143&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;그 후, 모달리티 정렬 모듈 적용&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;u&gt;PMA:&lt;/u&gt; Alignment bias를 가진 coross attention 기반 transformer&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; 구성되며 화자의 두 모달의 임베딩을 point-to-point 방식으로 정렬한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;585&quot; data-origin-height=&quot;569&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bh8U3N/btsO11IQL6d/LnOPr6xWhqeK5EXH0s4I2k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bh8U3N/btsO11IQL6d/LnOPr6xWhqeK5EXH0s4I2k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bh8U3N/btsO11IQL6d/LnOPr6xWhqeK5EXH0s4I2k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbh8U3N%2FbtsO11IQL6d%2FLnOPr6xWhqeK5EXH0s4I2k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;400&quot; height=&quot;389&quot; data-origin-width=&quot;585&quot; data-origin-height=&quot;569&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;497&quot; data-origin-height=&quot;77&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bsSB0l/btsO2Wmppkg/CWnkBrEy6iBdSzeaRE1tT1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bsSB0l/btsO2Wmppkg/CWnkBrEy6iBdSzeaRE1tT1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bsSB0l/btsO2Wmppkg/CWnkBrEy6iBdSzeaRE1tT1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbsSB0l%2FbtsO2Wmppkg%2FCWnkBrEy6iBdSzeaRE1tT1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;219&quot; height=&quot;34&quot; data-origin-width=&quot;497&quot; data-origin-height=&quot;77&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2) AFRG: Appropriate Facial Reaction Generation&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;544&quot; data-origin-height=&quot;701&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kVlpo/btsO0G7bhVm/6GSGiCG1vS1RYDESlDo84k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kVlpo/btsO0G7bhVm/6GSGiCG1vS1RYDESlDo84k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kVlpo/btsO0G7bhVm/6GSGiCG1vS1RYDESlDo84k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkVlpo%2FbtsO0G7bhVm%2F6GSGiCG1vS1RYDESlDo84k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;413&quot; height=&quot;532&quot; data-origin-width=&quot;544&quot; data-origin-height=&quot;701&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Conditional Interaction Encoder(CIE)를 이용해 현재 시점 [t&amp;minus;w+1,t] 동안의 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;여러가지의 얼굴 반응 세그먼트를 표현하는 분포를 예측&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이 분포는 3가지 요소를 기반으로 예측한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 화자의 오디오와 비디오(프레임)&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 청자의 이전 반응&amp;nbsp;&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 학습가능한 토큰 2개&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;775&quot; data-origin-height=&quot;164&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/yHMQS/btsO1fOQaTs/kF53lckgnaBsVBMBlm2eR0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/yHMQS/btsO1fOQaTs/kF53lckgnaBsVBMBlm2eR0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/yHMQS/btsO1fOQaTs/kF53lckgnaBsVBMBlm2eR0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FyHMQS%2FbtsO1fOQaTs%2FkF53lckgnaBsVBMBlm2eR0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;246&quot; height=&quot;52&quot; data-origin-width=&quot;775&quot; data-origin-height=&quot;164&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;그 후, Listener Reaction Decoder(LRD)는 샘플링된&amp;nbsp; z를 받아 3D로 디코딩한다. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;753&quot; data-origin-height=&quot;76&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bjdFn3/btsO1iENFHu/WvjYlndhIxQ9KXaFqrXH7K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bjdFn3/btsO1iENFHu/WvjYlndhIxQ9KXaFqrXH7K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bjdFn3/btsO1iENFHu/WvjYlndhIxQ9KXaFqrXH7K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbjdFn3%2FbtsO1iENFHu%2FWvjYlndhIxQ9KXaFqrXH7K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;337&quot; height=&quot;34&quot; data-origin-width=&quot;753&quot; data-origin-height=&quot;76&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3) SLBS: Speaker-Listener Behaviour Synchronisation&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;602&quot; data-origin-height=&quot;747&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dQCXGM/btsO1fOQQJm/pZDDkyZYEaN2tRx0om0Sk1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dQCXGM/btsO1fOQQJm/pZDDkyZYEaN2tRx0om0Sk1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dQCXGM/btsO1fOQQJm/pZDDkyZYEaN2tRx0om0Sk1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdQCXGM%2FbtsO1fOQQJm%2FpZDDkyZYEaN2tRx0om0Sk1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;281&quot; height=&quot;349&quot; data-origin-width=&quot;602&quot; data-origin-height=&quot;747&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;SLBS 모듈은 매  프레임마다 생성된 얼굴 반응을 화자의 현재 행동과 동기화한다. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;78&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cFIliL/btsO0TFkvCu/f47UZRDTdkmcyOAm4FKSJK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cFIliL/btsO0TFkvCu/f47UZRDTdkmcyOAm4FKSJK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cFIliL/btsO0TFkvCu/f47UZRDTdkmcyOAm4FKSJK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcFIliL%2FbtsO0TFkvCu%2Ff47UZRDTdkmcyOAm4FKSJK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;417&quot; height=&quot;32&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;78&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;4) FRV: Facial Reaction Visualisation&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;생성된 3D를 2D 얼굴 프레임으로 변환하기 위해 PIRender를 재학습하여 FaceVerse 3DMM에 맞게 조정한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;렌더링 네트워크는 생성된 3DMM 계수와 특정 청자 얼굴을 나타내는 reference portrait를 입력으로 받아 청자의 얼굴 반응이 담긴 2D 이미지 시퀀스를 출력할 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.2&amp;nbsp;Appropriate&amp;nbsp;Facial&amp;nbsp;Reactions&amp;nbsp;Generation&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; AFRG 모듈은 세 가지 블록으로 구성된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) CIE&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) Sampling Block&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3) LRD&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;블록 하나씩 살펴보면,&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;1) Conditional Interaction Encoder (CIE)&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3개의 트랜스포머 인코더 레이어로 구성된 variational encoder이다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Input: &lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 동기화된 화자의 이전 음성 및 얼굴 임베딩&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 두 개의 토큰( ,  )​&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Ouput:&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 평균 벡터 &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 표준편차 벡터 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 화자 행동에 반응하여 적절한 얼굴 반응들을 설명하는 정규 분포를 구성한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt;&lt;u&gt; &lt;span style=&quot;color: #000000; text-align: start;&quot;&gt;2) Sampling Block&lt;/span&gt;&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;t프레임에 해당하는 latent embedding 하나를 샘플링한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;560&quot; data-origin-height=&quot;59&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bFyMk5/btsO2B30qBg/tQSwLipHn8YqKc3nV7LoeK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bFyMk5/btsO2B30qBg/tQSwLipHn8YqKc3nV7LoeK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bFyMk5/btsO2B30qBg/tQSwLipHn8YqKc3nV7LoeK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbFyMk5%2FbtsO2B30qBg%2FtQSwLipHn8YqKc3nV7LoeK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;238&quot; height=&quot;25&quot; data-origin-width=&quot;560&quot; data-origin-height=&quot;59&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;시간 구간에 대응되는 w개의 latent vector 시퀀스를 얻기 위해 linear interpolation을 수행한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;즉, 하나의 샘플링으로 나머지는 interpolation을 해서 8장으로 늘리는 것이다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1044&quot; data-origin-height=&quot;57&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zAvIu/btsO2C9Fh7b/pVWsRii5k5rDZUeLnnM52K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zAvIu/btsO2C9Fh7b/pVWsRii5k5rDZUeLnnM52K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zAvIu/btsO2C9Fh7b/pVWsRii5k5rDZUeLnnM52K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzAvIu%2FbtsO2C9Fh7b%2FpVWsRii5k5rDZUeLnnM52K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;494&quot; height=&quot;27&quot; data-origin-width=&quot;1044&quot; data-origin-height=&quot;57&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;현재 시간의 청자 얼굴 반응 프레임은 모두 이전에 예측된 (t&amp;minus;w)번째 얼굴 반응 프레임을 기반으로 생성되며, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이전 프레임들과의 연속성이 보장된 얼굴 반응 시퀀스가 생성된다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;3) LRD: Listener Reaction Decoder&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;key와 value: positional encoding sequence&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;query: embedding sequenceCross-Attention 수행 &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1065&quot; data-origin-height=&quot;144&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/JVEym/btsO2UoM7He/tLub4bFN7skkwvGW5dTgMk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/JVEym/btsO2UoM7He/tLub4bFN7skkwvGW5dTgMk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/JVEym/btsO2UoM7He/tLub4bFN7skkwvGW5dTgMk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FJVEym%2FbtsO2UoM7He%2FtLub4bFN7skkwvGW5dTgMk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;371&quot; height=&quot;50&quot; data-origin-width=&quot;1065&quot; data-origin-height=&quot;144&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: justify;&quot;&gt;positional encoding은 sin, cos 기반의 함수를 사용해서 계산된다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;575&quot; data-origin-height=&quot;164&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ufmZ2/btsO2whDUL8/NuxKue8reQEIyF1Vs1qmC0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ufmZ2/btsO2whDUL8/NuxKue8reQEIyF1Vs1qmC0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ufmZ2/btsO2whDUL8/NuxKue8reQEIyF1Vs1qmC0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FufmZ2%2FbtsO2whDUL8%2FNuxKue8reQEIyF1Vs1qmC0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;252&quot; height=&quot;72&quot; data-origin-width=&quot;575&quot; data-origin-height=&quot;164&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.3&amp;nbsp;Speaker-listener&amp;nbsp;Behaviour&amp;nbsp;Synchronisation&lt;/span&gt;&lt;/h3&gt;
&lt;blockquote data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-style=&quot;style2&quot;&gt;&lt;b&gt;청자, 화자 동기화 필요성&lt;/b&gt;&lt;/blockquote&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;사람의 얼굴 반응은 시공간적 신호로, 특히 시간 축에서 화자의 행동과 밀접하게 연관되어 있다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;따라서, 생성된 얼굴 반응을 해당 화자의 행동과 제대로 동기화하는 것은 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;반응이 적절하고 사실적으로 보이게 하기 위한 중요한 요소다. &lt;/span&gt;&lt;/p&gt;
&lt;blockquote data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-style=&quot;style2&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;SLBS(Speaker-Listener Behaviour Synchronisation)&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/blockquote&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) Visual&amp;nbsp;Interaction&amp;nbsp;Model&amp;nbsp;(VIM)&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3D 얼굴 반응 임베딩을 정렬된 화자 얼굴 임베딩과 동기화한다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;753&quot; data-origin-height=&quot;78&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Ov6VX/btsO2WmBQxf/kuNbZ5TFBer0c8SERH1En1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Ov6VX/btsO2WmBQxf/kuNbZ5TFBer0c8SERH1En1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Ov6VX/btsO2WmBQxf/kuNbZ5TFBer0c8SERH1En1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOv6VX%2FbtsO2WmBQxf%2FkuNbZ5TFBer0c8SERH1En1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;318&quot; height=&quot;33&quot; data-origin-width=&quot;753&quot; data-origin-height=&quot;78&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) Modality Interaction Model (MIM)&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;동기화된 얼굴 반응을 해당 시점의 화자 음성 임베딩과 동기화한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;835&quot; data-origin-height=&quot;87&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/BH7M4/btsO1STmva3/cXUeA73u94T6QX5W6h59XK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/BH7M4/btsO1STmva3/cXUeA73u94T6QX5W6h59XK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/BH7M4/btsO1STmva3/cXUeA73u94T6QX5W6h59XK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBH7M4%2FbtsO1STmva3%2FcXUeA73u94T6QX5W6h59XK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;365&quot; height=&quot;38&quot; data-origin-width=&quot;835&quot; data-origin-height=&quot;87&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;생성된 얼굴 반응 프레임은 화자의 비언어적 얼굴 표정과 언어적 (음성) 모두와 동기화되어 생성된다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;본 논문에서는 VIM과 MIM 모두 Cross-Attention 연산으로 구현되며 새로운 Alignment bias가 도입된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; Alignment bias는 시간적으로 시점 i에 가까운 화자의 행동일수록 해당 시점에서 생성되는 얼굴 반응에 더 큰 영향을 준다는 가정에 기반하며,&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;1) 시간 축 상에서 생성된 얼굴 반응이 해당 시점의 화자 행동과 잘 동기화되도록 유도하고&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;2) 각 얼굴 반응 프레임이 현재 시점 이전까지의 정보만을 반영할 수 있도록 보장한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-end=&quot;262&quot; data-start=&quot;94&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; Alignment bias가 포함된 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;Cross-Attention은 다음과 같다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1037&quot; data-origin-height=&quot;126&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/45kuX/btsO2Dgxr42/RcqO8AZojksN2XEAKeKKUk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/45kuX/btsO2Dgxr42/RcqO8AZojksN2XEAKeKKUk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/45kuX/btsO2Dgxr42/RcqO8AZojksN2XEAKeKKUk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F45kuX%2FbtsO2Dgxr42%2FRcqO8AZojksN2XEAKeKKUk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;439&quot; height=&quot;53&quot; data-origin-width=&quot;1037&quot; data-origin-height=&quot;126&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;884&quot; data-origin-height=&quot;403&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/RUGY1/btsO1STmTax/VhXxbKuKo9apScvqmK835K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/RUGY1/btsO1STmTax/VhXxbKuKo9apScvqmK835K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/RUGY1/btsO1STmTax/VhXxbKuKo9apScvqmK835K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FRUGY1%2FbtsO1STmTax%2FVhXxbKuKo9apScvqmK835K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;301&quot; height=&quot;137&quot; data-origin-width=&quot;884&quot; data-origin-height=&quot;403&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-pm-slice=&quot;0 0 []&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;: 얼굴 반응 프레임 인덱스&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;: 화자 얼굴 프레임 (VIM) 또는 화자 음성 프레임 (MIM) 인덱스&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;: 음성 프레임이 얼굴 프레임보다 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;배 더 자주 샘플링된다는 비율&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;: 전체 프레임을 유닛 단위로 나누기 위한 윈도우 크기&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1. 현재 반응(i)과 과거 정보(j) 사이의 시간 거리&amp;nbsp;계산&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2. p(논문에서는 8로 설정)로 나누기&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3. k곱: 서로 다른 속도의 두 데이터(얼굴, 음성)의 시간 축을 동일하게 맞춘다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;868&quot; data-start=&quot;837&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;시점이 가까운 프레임일수록 행동 패턴이 유사하다는 사람 행동의 자연적 특성을 반영&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;- 지나치게 과거 프레임에 낮은 attention이 집중되지 않도록 하여 &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;long-range context도 충분히 반영되도록 유도&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;1117&quot; data-start=&quot;1023&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Attention 연산에서는 각 유닛 내의 프레임들을 동일하게 취급하며 시간 간격에 따라 점진적으로 감소하는 음수 값으로 bias가 주어진다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;1117&quot; data-start=&quot;1023&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;또한, 이 편향 행렬&lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;의 upper triangle은 &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;모두 (&amp;minus;&amp;infin;)로 설정되어 있다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;1117&quot; data-start=&quot;1023&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;각 얼굴 반응 프레임이 오직 현재 또는 과거의 화자 행동 정보만을 참조한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;783&quot; data-origin-height=&quot;634&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/UX5hv/btsO0VDfLVP/t6QU1qGPp6DzTuUnmzoqaK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/UX5hv/btsO0VDfLVP/t6QU1qGPp6DzTuUnmzoqaK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/UX5hv/btsO0VDfLVP/t6QU1qGPp6DzTuUnmzoqaK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FUX5hv%2FbtsO0VDfLVP%2Ft6QU1qGPp6DzTuUnmzoqaK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;497&quot; height=&quot;402&quot; data-origin-width=&quot;783&quot; data-origin-height=&quot;634&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.4&amp;nbsp;Loss&amp;nbsp;functions&amp;nbsp;and&amp;nbsp;training&amp;nbsp;strategy&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.4.1&amp;nbsp;Training&amp;nbsp;Strategy&lt;/span&gt;&lt;/h4&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; ReactFace 모델은 다음의 5가지 손실 함수를 공동 최적화하는 end-to-end 방식으로 학습된다:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;916&quot; data-origin-height=&quot;62&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bB99i8/btsO0WhNy6M/SLTrTlZqekCEpcZcbKaKsk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bB99i8/btsO0WhNy6M/SLTrTlZqekCEpcZcbKaKsk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bB99i8/btsO0WhNy6M/SLTrTlZqekCEpcZcbKaKsk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbB99i8%2FbtsO0WhNy6M%2FSLTrTlZqekCEpcZcbKaKsk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;460&quot; height=&quot;31&quot; data-origin-width=&quot;916&quot; data-origin-height=&quot;62&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1) 반응 생성 손실&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2) 화자 얼굴 복원 손실&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3) KL 발산 손실&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;4) 시간적 부드러움 손실&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;5) 다양성 손실&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;630&quot; data-origin-height=&quot;606&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/HideJ/btsO0VcautX/tNpYxbsKfyZ2J4keAGKv6K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/HideJ/btsO0VcautX/tNpYxbsKfyZ2J4keAGKv6K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/HideJ/btsO0VcautX/tNpYxbsKfyZ2J4keAGKv6K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FHideJ%2FbtsO0VcautX%2FtNpYxbsKfyZ2J4keAGKv6K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;348&quot; height=&quot;335&quot; data-origin-width=&quot;630&quot; data-origin-height=&quot;606&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.4.2&amp;nbsp;Loss&amp;nbsp;Functions&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;MSBEA가 입력으로 들어온&amp;nbsp;화자의 얼굴을 얼마나 정확하게 복원하는지 측정&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;L1 손실을 사용하여 (원본 화자 얼굴-모델이 복원한 화자 얼굴) 차이를 계산&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;632&quot; data-origin-height=&quot;99&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/pUpsa/btsO5cYY3TE/TCTVIXq5kH5ZrK9svBrFok/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/pUpsa/btsO5cYY3TE/TCTVIXq5kH5ZrK9svBrFok/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/pUpsa/btsO5cYY3TE/TCTVIXq5kH5ZrK9svBrFok/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FpUpsa%2FbtsO5cYY3TE%2FTCTVIXq5kH5ZrK9svBrFok%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;166&quot; height=&quot;26&quot; data-origin-width=&quot;632&quot; data-origin-height=&quot;99&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;모델이 특정 상황에 대한 반응이 잘 생성이 되었는지 확인하기 위해 &lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;미리 정의된 여러 개의 실제 반응들 집합과 모두 비교한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;그 집합 중에서 모델이 생성한 것과&amp;nbsp;가장 비슷한(min) 반응을 하나 찾고 오차를 계산한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1047&quot; data-origin-height=&quot;78&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c3IeS8/btsO6eVM5Cc/UkRQZjAgjKzkuqKb5fqVdK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c3IeS8/btsO6eVM5Cc/UkRQZjAgjKzkuqKb5fqVdK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c3IeS8/btsO6eVM5Cc/UkRQZjAgjKzkuqKb5fqVdK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc3IeS8%2FbtsO6eVM5Cc%2FUkRQZjAgjKzkuqKb5fqVdK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;336&quot; height=&quot;25&quot; data-origin-width=&quot;1047&quot; data-origin-height=&quot;78&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;하나의 동일한 상황에 대해 모델에게 M개(훈련 시 3개)의 다른 반응을 생성한다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;생성된 M개의 반응들을 두 개씩 짝지어, 모든 쌍(i와 j)이&amp;nbsp;서로 얼마나 비슷한지&amp;nbsp;거리를 측정한다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;두 반응이 매우&amp;nbsp;비슷하면&amp;nbsp;&amp;rarr; 결과값이&amp;nbsp;1에 가까워진다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;두 반응이 매우&amp;nbsp;다르면 &amp;rarr; 결과값이&amp;nbsp;0에 가까워진다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;페널티 부여: 유사도 점수들의 총합&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;즉, 생성된 반응들이 서로 비슷할수록 손실 값은 커지고, 서로 다를수록 손실 값은 작아집니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1050&quot; data-origin-height=&quot;167&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/AzmgZ/btsO4SfqTMX/Avhp74q1pqyFMkTzULyad0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/AzmgZ/btsO4SfqTMX/Avhp74q1pqyFMkTzULyad0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/AzmgZ/btsO4SfqTMX/Avhp74q1pqyFMkTzULyad0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FAzmgZ%2FbtsO4SfqTMX%2FAvhp74q1pqyFMkTzULyad0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;314&quot; height=&quot;50&quot; data-origin-width=&quot;1050&quot; data-origin-height=&quot;167&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이 손실은 VAE(Variational Autoencoder) 계열 모델의&lt;span&gt;&amp;nbsp;&lt;/span&gt;안정적인 학습을 위한 필수 요소입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;역할: 모델이 학습하는 잠재 공간(latent space)의 분포 N(&amp;mu;, &amp;sigma;)가 너무 제멋대로 뻗어나가지 않고, 표준 정규분포 N(0, I)에 가깝게 유지되도록 규제합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;434&quot; data-origin-height=&quot;97&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cW6E30/btsO681uyOE/X126hs9ixs56UiDD4SnXM0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cW6E30/btsO681uyOE/X126hs9ixs56UiDD4SnXM0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cW6E30/btsO681uyOE/X126hs9ixs56UiDD4SnXM0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcW6E30%2FbtsO681uyOE%2FX126hs9ixs56UiDD4SnXM0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;125&quot; height=&quot;28&quot; data-origin-width=&quot;434&quot; data-origin-height=&quot;97&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이 손실은 생성된 영상이&lt;span&gt;&amp;nbsp;&lt;/span&gt;물리적으로 자연스럽게&lt;span&gt;&amp;nbsp;&lt;/span&gt;보이도록 만든다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;생성된 영상에서 프레임 간 움직임이 갑자기 튀거나 뚝뚝 끊기는 현상(Jitter)을 방지한다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;(t-1)에서 (t)로의 움직임(속도)과&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;(t-2)에서 (t-1)로의 움직임(속도)이&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;서로 비슷해지도록&lt;span&gt; 만든다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1002&quot; data-origin-height=&quot;155&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/w1R8T/btsO0UElafJ/Di3Juo4KCMPr3chScGFM41/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/w1R8T/btsO0UElafJ/Di3Juo4KCMPr3chScGFM41/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/w1R8T/btsO0UElafJ/Di3Juo4KCMPr3chScGFM41/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fw1R8T%2FbtsO0UElafJ%2FDi3Juo4KCMPr3chScGFM41%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;317&quot; height=&quot;49&quot; data-origin-width=&quot;1002&quot; data-origin-height=&quot;155&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5&amp;nbsp;EXPERIMENTS&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;82&quot; data-start=&quot;51&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;이 장에서는 다음과 같은 순서로 실험 내용을 설명한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5.1&amp;nbsp;사용된 데이터셋&amp;nbsp;설명&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5.2 모든 실험 세팅 제시&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5.3, 5.4 ReactFace와 여러&amp;nbsp;베이스라인 모델들 간의 &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;정성적&amp;nbsp;및 정량적 성능 비교&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5.5, 5.7, 5.9에서는 &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Ablation study, perceptual study, failure case에 대해 논의한다.&lt;/span&gt;&lt;/p&gt;
&lt;h4 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5.1&amp;nbsp;Datasets&lt;/span&gt;&lt;/h4&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;629&quot; data-start=&quot;471&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;ReactFace는 REACT2023 Challenge에서 제공된 하이브리드 비디오 컨퍼런스 데이터셋을 활용해 평가된다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;629&quot; data-start=&quot;471&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;데이터셋은 총 2,962개의 2인 상호작용(dyadic interaction) 세션으로 구성되어 있으며 &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;구성은 다음과 같다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Training examples: 1,594개&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Validation examples: 562개&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;Test examples: 806개&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;723&quot; data-start=&quot;684&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;이 데이터는 두 개의 기존 비디오 컨퍼런스 데이터셋에서 수집되었다:&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;RECOLA&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;NOXI&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;UDIVA&lt;span&gt; (사용x)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;RECOLA &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://ieeexplore.ieee.org/document/6553805&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://ieeexplore.ieee.org/document/6553805&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;NOXI &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://dl.acm.org/doi/10.1145/3136755.3136780&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dl.acm.org/doi/10.1145/3136755.3136780&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;833&quot; data-start=&quot;762&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;각 세션은 30초 길이이며 두 명의 참여자 간 상호작용을 나타내는 오디오-비주얼 클립 쌍으로 이루어져 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;933&quot; data-start=&quot;835&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;또한, 객관적인 appropriate facial reaction에 대한 어노테이션은 &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;[22]에서 제안된 전략을 기반으로 자동 생성되었다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;994&quot; data-start=&quot;935&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;994&quot; data-start=&quot;935&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️&lt;span&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;REACT2023 Challenge &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;994&quot; data-start=&quot;935&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://dl.acm.org/doi/10.1145/3581783.3612832&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://dl.acm.org/doi/10.1145/3581783.3612832&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5.2&amp;nbsp;Experimental&amp;nbsp;Setup&lt;/span&gt;&lt;/h3&gt;
&lt;h4 style=&quot;text-align: justify;&quot; data-end=&quot;106&quot; data-start=&quot;63&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;5.2.1 Implementation Details&lt;/span&gt;&lt;/h4&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;154&quot; data-start=&quot;108&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;본 실험에서는 다음과 같은 설정으로 ReactFace 모델을 학습하였다:&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;&lt;b&gt;Input speaker image&amp;nbsp;sequence:&lt;/b&gt; 224 &amp;times; 224&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;&lt;b&gt;Optimizer&lt;/b&gt;&lt;/span&gt;&lt;b&gt;:&lt;/b&gt; AdamW&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;&lt;b&gt;Learning rate:&lt;/b&gt; 2e&amp;minus;5, &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&amp;beta;₁ = 0.9, &amp;beta;₂ = 0.999&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;&lt;b&gt;Minibatch size:&lt;/b&gt; 4&lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;368&quot; data-start=&quot;315&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;368&quot; data-start=&quot;315&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;&lt;b&gt;Loss Term Balancing Hyper-parameters&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;(4.4.1절의 손실 함수 항들에 대한 하이퍼파라미터 설정)&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;217&quot; data-origin-height=&quot;52&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/otRml/btsO1TLuSPj/bylkescLDQwnMlkoozzkOK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/otRml/btsO1TLuSPj/bylkescLDQwnMlkoozzkOK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/otRml/btsO1TLuSPj/bylkescLDQwnMlkoozzkOK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FotRml%2FbtsO1TLuSPj%2FbylkescLDQwnMlkoozzkOK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;79&quot; height=&quot;52&quot; data-origin-width=&quot;217&quot; data-origin-height=&quot;52&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignLeft&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;204&quot; data-origin-height=&quot;48&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OPNcP/btsO20o4qbm/KxrPa7pxCCHG6GRsEIJEJK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OPNcP/btsO20o4qbm/KxrPa7pxCCHG6GRsEIJEJK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OPNcP/btsO20o4qbm/KxrPa7pxCCHG6GRsEIJEJK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOPNcP%2FbtsO20o4qbm%2FKxrPa7pxCCHG6GRsEIJEJK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;73&quot; height=&quot;17&quot; data-origin-width=&quot;204&quot; data-origin-height=&quot;48&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignLeft&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;215&quot; data-origin-height=&quot;49&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/y9P6b/btsO1Tku7Ir/fcJBoo2JLjKi9K42PEEJY1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/y9P6b/btsO1Tku7Ir/fcJBoo2JLjKi9K42PEEJY1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/y9P6b/btsO1Tku7Ir/fcJBoo2JLjKi9K42PEEJY1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fy9P6b%2FbtsO1Tku7Ir%2FfcJBoo2JLjKi9K42PEEJY1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;88&quot; height=&quot;49&quot; data-origin-width=&quot;215&quot; data-origin-height=&quot;49&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;518&quot; data-start=&quot;456&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;518&quot; data-start=&quot;456&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;또한, 4.2에서 사용된 &lt;b&gt;momentum&amp;nbsp;parameter '&amp;alpha;'&lt;/b&gt;는 경험적으로 0.999로 설정되었다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;518&quot; data-start=&quot;456&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;구현:&lt;/b&gt; PyTorch&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;GPU:&lt;/b&gt; Tesla A100 80GB &lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;200 epochs&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;213&quot; data-start=&quot;26&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;3DMM(3D Morphable Model) 모델:&lt;/b&gt; FaceVerse&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;213&quot; data-start=&quot;26&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;FaceVerse 계수 정의:&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;213&quot; data-start=&quot;26&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1) Expression coefficients:&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignLeft&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;377&quot; data-origin-height=&quot;67&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/UxTfX/btsO2GYElQZ/66QOseSk4miebkD83CWG40/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/UxTfX/btsO2GYElQZ/66QOseSk4miebkD83CWG40/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/UxTfX/btsO2GYElQZ/66QOseSk4miebkD83CWG40/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FUxTfX%2FbtsO2GYElQZ%2F66QOseSk4miebkD83CWG40%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;124&quot; height=&quot;22&quot; data-origin-width=&quot;377&quot; data-origin-height=&quot;67&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;2) pose coefficients &lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&amp;theta; (3-dimensional translation and the 3-dimensional rotation)&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&amp;rarr; 즉, 각 얼굴 프레임당 총 58&amp;nbsp;coefficients를 포함한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;352&quot; data-start=&quot;329&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;blockquote data-end=&quot;352&quot; data-start=&quot;329&quot; data-ke-style=&quot;style2&quot;&gt;&lt;span style=&quot;color: #333333;&quot;&gt;&lt;b&gt;FaceVerse 모델을 사용한 이유&lt;/b&gt;&lt;/span&gt;&lt;/blockquote&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;462&quot; data-start=&quot;354&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;FaceVerse의 expression coefficient는 ARKit의 블렌드셰이프(blendshape)와 1:1로 대응되며&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;462&quot; data-start=&quot;354&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;다음과 같이 명확하고 사람이 해석 가능한 의미를 갖는 요소로 정의된다:&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;462&quot; data-start=&quot;354&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;browInnerUp&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;462&quot; data-start=&quot;354&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;eyeLookDownRight&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;462&quot; data-start=&quot;354&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;jawOpen&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;462&quot; data-start=&quot;354&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;mouthFunnel&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;462&quot; data-start=&quot;354&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;noseSneerRight, tongueOut 등&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;662&quot; data-start=&quot;562&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;기존의 3DMM 시스템들이 사용하는 PCA 기반 블렌드셰이프와는 대조적이다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;662&quot; data-start=&quot;562&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;PCA 기반 방식은 주성분 축을 따라 얼굴을 변형시키기 때문에 직관적 해석이 어렵다.&lt;/span&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;blockquote data-end=&quot;695&quot; data-start=&quot;669&quot; data-ke-style=&quot;style2&quot;&gt;&lt;span style=&quot;color: #333333;&quot;&gt;&lt;b&gt;ARKit 기반 FaceVerse의 장점&lt;/b&gt;&lt;/span&gt;&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;표정 근육의 micro-expression을 더 정교하게 표현할 수 있음&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;결과적으로 더 정확하고 현실적인 표정 묘사 가능&lt;/span&gt;&lt;/p&gt;
&lt;h4 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;5.2.2&amp;nbsp;Baselines&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;985&quot; data-origin-height=&quot;308&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cD4fb2/btsO2y7C66m/bUjcegJI8D9IEsatsTBOjK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cD4fb2/btsO2y7C66m/bUjcegJI8D9IEsatsTBOjK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cD4fb2/btsO2y7C66m/bUjcegJI8D9IEsatsTBOjK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcD4fb2%2FbtsO2y7C66m%2FbUjcegJI8D9IEsatsTBOjK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;985&quot; height=&quot;308&quot; data-origin-width=&quot;985&quot; data-origin-height=&quot;308&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;289&quot; data-start=&quot;26&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;1) 다섯 가지 베이스라인 모델&lt;/span&gt; &lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Mirror:&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;화자의 얼굴 움직임을 그대로 복제하여 반응을 생성하는 방식&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Random:&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;가우시안 분포에서 무작위로 표정 반응을 샘플링&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;NN motion:&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;현재 화자의 모션 시퀀스와 가장 가까운(Nearest Neighbor) 모션 시퀀스를 검색한 뒤, 상응하는 청자의 반응 시퀀스를 반환&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;NN audio:&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;화자의 오디오 신호를 기반으로 최근접 이웃을 탐색&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Trans-AE:&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;트랜스포머 기반 오토인코더 모델로, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;ReactFace와 동일한 화자 행위 인코더 및 정렬 모듈을 사용하지만 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;단순한 디코더 구조로 반응 시퀀스를 출력한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;781&quot; data-start=&quot;758&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) 기존 최신 얼굴 반응 생성 기법&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;LFT: &lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;flow 기반 생성 모델로, 청자의 머리 움직임 생성에서 뛰어난 성능을 보였다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt;Ng et al.:&lt;/b&gt; 화자의 3DMM 계수와 오디오 신호를 입력으로 받아, 모션-오디오 cross-attention transformer를 사용하는 VQ-VAE 기반 모델&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt;Zhou et al.:&lt;/b&gt; LSTM 기반 시퀀스-투-시퀀스 모델로, 화자의 3DMM 계수, 다양한 오디오 특성 그리고 청자의 첫 프레임 3DMM 초기값을 입력으로 받아 청자의 3DMM 시퀀스를 출력한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;LFT&lt;/span&gt; &lt;/span&gt;&lt;span style=&quot;text-align: start;&quot;&gt;논문 아카이브 링크 ⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2006.09888&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2006.09888&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;Ng et al.&lt;/span&gt; &lt;/span&gt;&lt;span style=&quot;text-align: start;&quot;&gt;논문 아카이브 링크 ⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2204.08451&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2204.08451&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;Zhou et al.&lt;span&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;text-align: start;&quot;&gt;논문 아카이브 링크 ⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2112.13548&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2112.13548&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div data-message-model-slug=&quot;gpt-4o&quot; data-message-id=&quot;96bdddb4-6598-487a-a0fb-9bfcf8db3e9e&quot; data-message-author-role=&quot;assistant&quot;&gt;
&lt;div&gt;
&lt;div&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;89&quot; data-start=&quot;0&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;공정하고 포괄적인 비교를 보장하기 위해 Zhou et al.에&lt;/span&gt;서 사용된 RLD 데이터셋에 대한 실험 결과도 추가로 제시한다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;최종 얼굴 반응 생성을 위해 본 연구와 동일한 프로토콜을 따랐으며 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;Zhou et al.와 &lt;/span&gt;동일한 3D 형태 모델(3DMM) 시스템을 활용하였다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h4 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;5.2.3&amp;nbsp;Evaluation&amp;nbsp;Metrics&lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;ReactFace 논문에서 사용한 Evaluation Metrics는&amp;nbsp;총 5가지 항목을 기준으로 한다.&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;border-collapse: collapse; width: 100%;&quot; border=&quot;1&quot; data-ke-align=&quot;alignLeft&quot;&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Appropriateness&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;FRCorr&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;실제 반응과의 유사성&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Diversity (1)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;FRDvs&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;다른 입력 간 생성 다양성&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Diversity (2)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;FRDiv&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;같은 입력에 대한 다양한 샘플 생성&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Diversity (3)&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;FRVar&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;한 클립 내의 표정 변화도&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Realism&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;FVD&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;얼마나 실제 같은가&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Synchrony&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;TLCC&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;시간적으로 화자 행동과 잘 맞는가&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Speed&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;FPS&lt;/span&gt;&lt;/td&gt;
&lt;td&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;얼마나 빠르게 생성 가능한가&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;716&quot; data-origin-height=&quot;94&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/xH1Xa/btsO5O4chQP/Bt0l614VTyD0cEUmYaNYW0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/xH1Xa/btsO5O4chQP/Bt0l614VTyD0cEUmYaNYW0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/xH1Xa/btsO5O4chQP/Bt0l614VTyD0cEUmYaNYW0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxH1Xa%2FbtsO5O4chQP%2FBt0l614VTyD0cEUmYaNYW0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;427&quot; height=&quot;56&quot; data-origin-width=&quot;716&quot; data-origin-height=&quot;94&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;697&quot; data-origin-height=&quot;94&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/5ZljD/btsO43nsEtZ/Dkyg7hC0M6UF5Tj0HirQk0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/5ZljD/btsO43nsEtZ/Dkyg7hC0M6UF5Tj0HirQk0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/5ZljD/btsO43nsEtZ/Dkyg7hC0M6UF5Tj0HirQk0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F5ZljD%2FbtsO43nsEtZ%2FDkyg7hC0M6UF5Tj0HirQk0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;363&quot; height=&quot;49&quot; data-origin-width=&quot;697&quot; data-origin-height=&quot;94&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;692&quot; data-origin-height=&quot;92&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bkzsaw/btsO6XZ84ZK/igGDqmLfXNkQdm1K3xsrsK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bkzsaw/btsO6XZ84ZK/igGDqmLfXNkQdm1K3xsrsK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bkzsaw/btsO6XZ84ZK/igGDqmLfXNkQdm1K3xsrsK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbkzsaw%2FbtsO6XZ84ZK%2FigGDqmLfXNkQdm1K3xsrsK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;369&quot; height=&quot;49&quot; data-origin-width=&quot;692&quot; data-origin-height=&quot;92&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;501&quot; data-origin-height=&quot;100&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d1xzQp/btsO7hKPJFP/bv0zQjKoNzkEB80I6p9shk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d1xzQp/btsO7hKPJFP/bv0zQjKoNzkEB80I6p9shk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d1xzQp/btsO7hKPJFP/bv0zQjKoNzkEB80I6p9shk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd1xzQp%2FbtsO7hKPJFP%2Fbv0zQjKoNzkEB80I6p9shk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;245&quot; height=&quot;49&quot; data-origin-width=&quot;501&quot; data-origin-height=&quot;100&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1166&quot; data-origin-height=&quot;819&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lXz1Z/btsO1HLlztD/vHTd8wem3DPEUANAaOBkl1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lXz1Z/btsO1HLlztD/vHTd8wem3DPEUANAaOBkl1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lXz1Z/btsO1HLlztD/vHTd8wem3DPEUANAaOBkl1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlXz1Z%2FbtsO1HLlztD%2FvHTd8wem3DPEUANAaOBkl1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1166&quot; height=&quot;819&quot; data-origin-width=&quot;1166&quot; data-origin-height=&quot;819&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;763&quot; data-origin-height=&quot;577&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bPWknZ/btsO6BXjGSz/WablJjtUmhb3VWP69kAr71/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bPWknZ/btsO6BXjGSz/WablJjtUmhb3VWP69kAr71/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bPWknZ/btsO6BXjGSz/WablJjtUmhb3VWP69kAr71/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbPWknZ%2FbtsO6BXjGSz%2FWablJjtUmhb3VWP69kAr71%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;469&quot; height=&quot;355&quot; data-origin-width=&quot;763&quot; data-origin-height=&quot;577&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;768&quot; data-origin-height=&quot;553&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OG66M/btsO6LrWhyO/doqYf2sWC74y1iQ5qY9TW0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OG66M/btsO6LrWhyO/doqYf2sWC74y1iQ5qY9TW0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OG66M/btsO6LrWhyO/doqYf2sWC74y1iQ5qY9TW0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOG66M%2FbtsO6LrWhyO%2FdoqYf2sWC74y1iQ5qY9TW0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;482&quot; height=&quot;347&quot; data-origin-width=&quot;768&quot; data-origin-height=&quot;553&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;767&quot; data-origin-height=&quot;626&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bhNuIy/btsO5YZUmuZ/fLp9vVjdGX0YSbYqk48Hz1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bhNuIy/btsO5YZUmuZ/fLp9vVjdGX0YSbYqk48Hz1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bhNuIy/btsO5YZUmuZ/fLp9vVjdGX0YSbYqk48Hz1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbhNuIy%2FbtsO5YZUmuZ%2FfLp9vVjdGX0YSbYqk48Hz1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;489&quot; height=&quot;399&quot; data-origin-width=&quot;767&quot; data-origin-height=&quot;626&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;730&quot; data-origin-height=&quot;321&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcCBMe/btsO5mUEhrc/F6RAAqDlZB3EtSvomlGrv1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcCBMe/btsO5mUEhrc/F6RAAqDlZB3EtSvomlGrv1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcCBMe/btsO5mUEhrc/F6RAAqDlZB3EtSvomlGrv1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbcCBMe%2FbtsO5mUEhrc%2FF6RAAqDlZB3EtSvomlGrv1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;443&quot; height=&quot;195&quot; data-origin-width=&quot;730&quot; data-origin-height=&quot;321&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review</category>
      <category>empathicai</category>
      <category>multimodal</category>
      <category>reactface</category>
      <category>vae</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/116</guid>
      <comments>https://honey-vision.tistory.com/116#entry116comment</comments>
      <pubDate>Wed, 2 Jul 2025 02:08:40 +0900</pubDate>
    </item>
    <item>
      <title>NoXi/RECOLA Dataset 요청하기</title>
      <link>https://honey-vision.tistory.com/115</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;ReactFace: Online Multiple Appropriate Facial Reaction Generation in Dyadic Interactions&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;논문에서 사용한 NoXi와 RECOLA 데이터셋을 다운 받기 위해 양식을 작성하고 메일로 요청한다.&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;color: #000000; text-align: start;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;NoXi&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;데이터셋 제공 홈페이지&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;NoXi Dataset&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;a href=&quot;https://multimediate-challenge.org/datasets/Dataset_NoXi/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://multimediate-challenge.org/datasets/Dataset_NoXi/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1751371402666&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;MultiMediate:Multi-modal Behaviour Analysis for Artificial Mediation&quot; data-og-description=&quot;Grand Challenge at ACM MM&amp;rsquo;25&quot; data-og-host=&quot;multimediate.perceptualui.org&quot; data-og-source-url=&quot;https://multimediate-challenge.org/datasets/Dataset_NoXi/&quot; data-og-url=&quot;https://multimediate.perceptualui.org&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/y4G4f/hyZf6sopgx/AzDekbmDCfjvKZPX6mvhYk/img.jpg?width=646&amp;amp;height=655&amp;amp;face=71_41_579_431&quot;&gt;&lt;a href=&quot;https://multimediate-challenge.org/datasets/Dataset_NoXi/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://multimediate-challenge.org/datasets/Dataset_NoXi/&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/y4G4f/hyZf6sopgx/AzDekbmDCfjvKZPX6mvhYk/img.jpg?width=646&amp;amp;height=655&amp;amp;face=71_41_579_431');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;MultiMediate:Multi-modal Behaviour Analysis for Artificial Mediation&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Grand Challenge at ACM MM&amp;rsquo;25&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;multimediate.perceptualui.org&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;작성해야 하는 양식 pdf 자료&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/&lt;a href=&quot;https://multimediate-challenge.org/assets/pdf/EULA-NoXi.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://multimediate-challenge.org/assets/pdf/EULA-NoXi.pdf&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;RECOLA&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 데이터셋 제공 홈페이지 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;RECOLA Dataset&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;a href=&quot;https://qualinet.github.io/databases/audiovisual/recola/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://qualinet.github.io/databases/audiovisual/recola/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1751371391225&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;article&quot; data-og-title=&quot;RECOLA&quot; data-og-description=&quot;Subjective test databases&quot; data-og-host=&quot;qualinet.github.io&quot; data-og-source-url=&quot;https://qualinet.github.io/databases/audiovisual/recola/&quot; data-og-url=&quot;https://qualinet.github.io/databases/audiovisual/recola/&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://qualinet.github.io/databases/audiovisual/recola/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://qualinet.github.io/databases/audiovisual/recola/&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;RECOLA&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Subjective test databases&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;qualinet.github.io&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start;&quot;&gt;작성해야 하는 양식 pdf 자료&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/&lt;a href=&quot;https://diuf.unifr.ch/main/diva/recola/data/eula_recola_database.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://diuf.unifr.ch/main/diva/recola/data/eula_recola_database.pdf&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</description>
      <category>noxi</category>
      <category>reactface</category>
      <category>recola</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/115</guid>
      <comments>https://honey-vision.tistory.com/115#entry115comment</comments>
      <pubDate>Tue, 1 Jul 2025 22:04:45 +0900</pubDate>
    </item>
    <item>
      <title>텍스트 기반 공감 얼굴 표정 생성 모델</title>
      <link>https://honey-vision.tistory.com/114</link>
      <description>&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;기존 연구들에서는 멀티모달이나 단일 모델로 슬픈 표정을 짓고 있는 얼굴에 대해 'Sad' 이런 식으로 라벨 결과를 출력했다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;혹은 텍스트 임베딩 값에 따라 문맥을 보고 사용자의 감정이 어떤지를 예측한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이제는 인식을 넘어 &lt;b&gt;공감을 하는 모델&lt;/b&gt;을 만들어보자.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이번 실험은 간단한 모듈들을 활용하기 때문에 모델 내부를 살펴보지는 않는다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;결과물은 다음과 같다. &lt;u&gt;텍스트를 입력하면 공감하는 텍스트가 출력되고 그 값을 기반으로 얼굴 표정을 생성한다.&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;625&quot; data-origin-height=&quot;455&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cEWaVF/btsOY0irI1H/NL1ldtG0GjOPtIFITO7LJ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cEWaVF/btsOY0irI1H/NL1ldtG0GjOPtIFITO7LJ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cEWaVF/btsOY0irI1H/NL1ldtG0GjOPtIFITO7LJ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcEWaVF%2FbtsOY0irI1H%2FNL1ldtG0GjOPtIFITO7LJ1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;514&quot; height=&quot;374&quot; data-origin-width=&quot;625&quot; data-origin-height=&quot;455&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;628&quot; data-origin-height=&quot;456&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d66gAn/btsOYfgmzmX/mRmzHTYJ2hcH3W6cnDmK90/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d66gAn/btsOYfgmzmX/mRmzHTYJ2hcH3W6cnDmK90/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d66gAn/btsOYfgmzmX/mRmzHTYJ2hcH3W6cnDmK90/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd66gAn%2FbtsOYfgmzmX%2FmRmzHTYJ2hcH3W6cnDmK90%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;497&quot; height=&quot;361&quot; data-origin-width=&quot;628&quot; data-origin-height=&quot;456&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;공감 텍스트를 생성하는 것이 중요하게 되는데,&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이 모델은 T5 공감 텍스트 생성 모델을 사용한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;얼굴 표정 이미지를 생성하는 모델은 Stable Diffusion을 사용한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1751267143163&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import gradio as gr
import uuid
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from diffusers import StableDiffusionPipeline
import os

# 1. T5 공감 텍스트 생성 모델
t5_tokenizer = AutoTokenizer.from_pretrained(&quot;pixelsandpointers/t5-empatheticdialogues&quot;)
t5_model = AutoModelForSeq2SeqLM.from_pretrained(&quot;pixelsandpointers/t5-empatheticdialogues&quot;)

# 2. 감정 분류기
emotion_classifier = pipeline(&quot;text-classification&quot;, model=&quot;j-hartmann/emotion-english-distilroberta-base&quot;)

# 3. 얼굴 이미지 생성기 (stable diffusion)
device = &quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;
sd_pipe = StableDiffusionPipeline.from_pretrained(&quot;CompVis/stable-diffusion-v1-4&quot;, torch_dtype=torch.float16 if device==&quot;cuda&quot; else torch.float32)
sd_pipe = sd_pipe.to(device)

# 4. 공감 텍스트 생성 함수
def t5_empathic_response(user_input: str):
    inputs = t5_tokenizer(user_input, return_tensors=&quot;pt&quot;, truncation=True)
    outputs = t5_model.generate(**inputs, max_length=60)
    return t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

# 5. 얼굴 이미지 생성 함수
def generate_emotion_face_image(emotion: str, output_path: str):
    prompt = f&quot;a human face expressing {emotion.lower()} emotion, portrait, realistic, 4k&quot;
    image = sd_pipe(prompt).images[0]
    image.save(output_path)

# 6. 전체 파이프라인
def full_chatbot_pipeline(user_input, _):  # 영상 없이 텍스트만 입력
    response_text = t5_empathic_response(user_input)
    emotion = emotion_classifier(response_text)[0]['label'].upper()

    if emotion == &quot;JOY&quot;:
        emotion = &quot;HAPPINESS&quot;  # 이미지 생성 시 더 잘 표현됨

    output_path = f&quot;results/face_{uuid.uuid4().hex}.png&quot;
    os.makedirs(&quot;results&quot;, exist_ok=True)
    generate_emotion_face_image(emotion, output_path)

    result_text = (
        f&quot;  사용자: {user_input}\n&quot;
        f&quot;  공감 응답: {response_text}\n&quot;
        f&quot;  감정: {emotion}&quot;
    )
    return result_text, output_path

# 7. Gradio UI
with gr.Blocks() as demo:
    with gr.Row():
        text_input = gr.Textbox(label=&quot;  사용자 입력&quot;, placeholder=&quot;예: 나 오늘 힘들었어...&quot;)
    generate_btn = gr.Button(&quot;  공감 반응 및 얼굴 생성&quot;)

    with gr.Row():
        result_textbox = gr.Textbox(label=&quot;  공감 텍스트 및 감정 결과&quot;)
        image_output = gr.Image(label=&quot;  생성된 얼굴 이미지&quot;)

    generate_btn.click(fn=full_chatbot_pipeline,
                       inputs=[text_input, gr.State(None)],
                       outputs=[result_textbox, image_output])

demo.launch()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;실험 결과, diffusion 모델의 랜덤성으로 인해 조금은 기괴한 이미지들이 생성되는 결과를 볼 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;공감 라벨에 따라 표정은 맞게 나오지만 전체 이미지가 사람이 보기에 불편할 정도로 모호한 결과다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;628&quot; data-origin-height=&quot;454&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/TOd0K/btsOXjKxFDV/gIPklOzd5cgWxJP9Zkfbkk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/TOd0K/btsOXjKxFDV/gIPklOzd5cgWxJP9Zkfbkk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/TOd0K/btsOXjKxFDV/gIPklOzd5cgWxJP9Zkfbkk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FTOd0K%2FbtsOXjKxFDV%2FgIPklOzd5cgWxJP9Zkfbkk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;496&quot; height=&quot;359&quot; data-origin-width=&quot;628&quot; data-origin-height=&quot;454&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;619&quot; data-origin-height=&quot;450&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/nroPX/btsOWpxmugl/jVv8VDwL3LWlW5udisonkk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/nroPX/btsOWpxmugl/jVv8VDwL3LWlW5udisonkk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/nroPX/btsOWpxmugl/jVv8VDwL3LWlW5udisonkk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnroPX%2FbtsOWpxmugl%2FjVv8VDwL3LWlW5udisonkk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;492&quot; height=&quot;358&quot; data-origin-width=&quot;619&quot; data-origin-height=&quot;450&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;처음에 '공감'하는 모델을 만든다고 했는데, 사실 모순이라고도 볼 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;텍스트의 공감 라벨을 따라 표정을 생성했기 때문이다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;공감이라는 건 여러 모달리티의 퓨전 값이 복합적으로 이루어져야 하기에 텍스트 기반 공감은 조금 부족하다고 볼 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;다음은 텍스트 한 개의 모달만이 아닌 &lt;u&gt;멀티모달을 활용한 공감 생성 모델&lt;/u&gt;을 만들어보도록 하겠다.&lt;/span&gt;&lt;/p&gt;</description>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/114</guid>
      <comments>https://honey-vision.tistory.com/114#entry114comment</comments>
      <pubDate>Mon, 30 Jun 2025 16:07:01 +0900</pubDate>
    </item>
    <item>
      <title>Fidelity와 Quality</title>
      <link>https://honey-vision.tistory.com/110</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;신호분야에서 수신기와 발신기로부터 시작된 'Fidelity'.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Video Super-Resolution 태스크에서도 이 fidelity의 밸런스가 중요하게 여겨진다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;그렇다면 fidelity는 무엇일까?&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Fidelity: '원본으로부터 얼마나 재현을 잘 했는가'를 의미한다.&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;그래서 fidelity가 높을 수록 원본과 비슷한 결과물을 얻을 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;하지만 VSR에서 원본과 비슷하다는건 low-resolution과 비슷하다는 것이기 때문에&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;High-resolution으로 만들어내는 vsr의 본질적인 목표와는 거리가 멀다고 볼 수 있다.&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt;Fidelity가 높은 경우&lt;/b&gt;의 프레임을 확인해보자.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;(쉬운 보기를 위한 자료로 예시가 적절하지 않을 수 있다)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;첫 번째 프레임이 input이라고 가정했을때, 오른쪽 3장의 경우 모두 비슷한 것을 볼 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;사람의 눈으로 보았을때 어색하지 않고 자연스럽다고 느낄 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;747&quot; data-origin-height=&quot;172&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/tntl2/btsOSwPOjw8/VgkKE3MaWG7Dk1KmD8eTSK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/tntl2/btsOSwPOjw8/VgkKE3MaWG7Dk1KmD8eTSK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/tntl2/btsOSwPOjw8/VgkKE3MaWG7Dk1KmD8eTSK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Ftntl2%2FbtsOSwPOjw8%2FVgkKE3MaWG7Dk1KmD8eTSK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;747&quot; height=&quot;172&quot; data-origin-width=&quot;747&quot; data-origin-height=&quot;172&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이번에는 &lt;b&gt;Fidelity가 낮은 경우&lt;/b&gt;를 확인해보자.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;아래 예시 사진을 보면 프레임마다 기둥 부분이 다르게 생성된 것을 확인할 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;위의 자료보다 Quality는 좋아 보이지만 사람의 눈으로 보았을때 부자연스럽고 어색하다는 것을 느낄 수 있다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;746&quot; data-origin-height=&quot;173&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/caJBb8/btsOSCJpgug/wYriJ8KGilXTudxtx6nJG1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/caJBb8/btsOSCJpgug/wYriJ8KGilXTudxtx6nJG1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/caJBb8/btsOSCJpgug/wYriJ8KGilXTudxtx6nJG1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcaJBb8%2FbtsOSCJpgug%2FwYriJ8KGilXTudxtx6nJG1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;746&quot; height=&quot;173&quot; data-origin-width=&quot;746&quot; data-origin-height=&quot;173&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이러한 문제는 Video라는 sequence 데이터에서 좋지 않은 결과를 보여주게 된다. 최근 활발하게 연구되고 있는 Diffusion 기반의 Video Super-Resolution 태스크에서는 이러한 Fidelity 문제를 해결하고자 하는 다양한 방법이 제안되고 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Study</category>
      <category>Fidelity</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/110</guid>
      <comments>https://honey-vision.tistory.com/110#entry110comment</comments>
      <pubDate>Tue, 24 Jun 2025 10:23:32 +0900</pubDate>
    </item>
    <item>
      <title>[논문리뷰] Upscale-A-Video</title>
      <link>https://honey-vision.tistory.com/108</link>
      <description>&lt;h2 style=&quot;text-align: justify;&quot; data-end=&quot;56&quot; data-start=&quot;39&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Abstract&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;416&quot; data-start=&quot;58&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Text-based diffusion models은 generation 및 editing분야에서 좋은 성과를 보여주도 있지만 vsr 분야에서는 &lt;u&gt;dm의 무작위성 때문에 output fidelity와 temporal consistency을 동시에 만족시키기 어렵다.&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;416&quot; data-start=&quot;58&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;416&quot; data-start=&quot;58&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;output fidelity: 단일 이미지가 아닌 하나의 영상이기 때문에 &lt;/span&gt;프레임이 제 각각 생성이 되면 안됨. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;416&quot; data-start=&quot;58&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;temporal consistency: 하나의 영상이 자연스럽게 이어져야 함.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;416&quot; data-start=&quot;58&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;416&quot; data-start=&quot;58&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이러한 문제를 해결하고자 본 연구에서는 Upscale-A-Video라는 프레임워크를 제안한다. text-guided&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp;latent diffusion 프레임워크로서, 두 가지 메커니즘을 통해 시간적 temporal consistency를 보장한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;578&quot; data-start=&quot;418&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;578&quot; data-start=&quot;418&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt;1. Local level&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;578&quot; data-start=&quot;418&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp; &amp;nbsp;- U-Net과 VAE-Decoder에 temporal layers 통합하여&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp;짧은 시퀀스 내 consistency를 유지할 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;578&quot; data-start=&quot;418&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;&lt;b&gt;2. Global level&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;578&quot; data-start=&quot;418&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;&amp;nbsp; &amp;nbsp;- 학습 없이&lt;/span&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;, flow-guided recurrent latent propagation module을 도입하여 전체 시퀀스 stability를 향상시킬 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;997&quot; data-start=&quot;835&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;997&quot; data-start=&quot;835&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;또한, 텍스트 프롬프트를 통해 질감 생성을 유도하거나, 노이즈 수준을 조절하여 복원과 생성 간의 균형을 조절하는 기능을 제공한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;997&quot; data-start=&quot;835&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이로써 fidelity와 quality 사이의 트레이드오프가 가능해진다.&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1.&amp;nbsp;Introduction&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;139&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;&lt;u&gt;Video Super-Resolution을 수행하기 위한 기존 방법&lt;/u&gt;: &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;synthetic degradations 또는 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;camera-related degradations에 초점을 맞추어 진행되었지만 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;&lt;b&gt;real-world에서 한계&lt;/b&gt;가 존재한다.&lt;/span&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;334&quot; data-start=&quot;141&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;334&quot; data-start=&quot;141&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;text-align: justify;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;text-align: justify;&quot;&gt;synthetic degradations: &lt;/span&gt;원래 고품질의 비디오에 인위적으로 노이즈, blur, downsampling 등을 추가한 저화질 데이터&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;334&quot; data-start=&quot;141&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;text-align: justify;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;text-align: justify;&quot;&gt;camera-related degradations: &lt;/span&gt;비디오가 실제 카메라 시스템에 의해 캡처될 때 발생하는 다양한 품질 저하 &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;574&quot; data-start=&quot;439&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;794&quot; data-start=&quot;576&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;그 후 CNN 기반 방법&lt;/u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;:&amp;nbsp;&lt;/span&gt; CNN 기반 네트워크들은 다양한 열화 요소들을 어느 정도 개선하는 데 성공하였지만, generative capability으로 인해 texture와 detail를 생성하는 데에는 여전히 부족하며, 그 결과로 &lt;b&gt;over-smoothing 현상&lt;/b&gt;이 자주 발생한다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;794&quot; data-start=&quot;576&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;794&quot; data-start=&quot;576&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;아래 자료는 RealBasicVSR의 over-smoothing의 예시다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignRight&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1004&quot; data-origin-height=&quot;421&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cJDZZh/btsOKTLfeXN/AmOkA2ddGokUs6AXckAb81/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cJDZZh/btsOKTLfeXN/AmOkA2ddGokUs6AXckAb81/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cJDZZh/btsOKTLfeXN/AmOkA2ddGokUs6AXckAb81/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcJDZZh%2FbtsOKTLfeXN%2FAmOkA2ddGokUs6AXckAb81%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;836&quot; height=&quot;421&quot; data-origin-width=&quot;1004&quot; data-origin-height=&quot;421&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;369&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;369&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;u&gt;그 이후 diffusion 기반 방법:&lt;/u&gt; CNN 기반 모델에서 발생하는 over-smoothing 문제를 효과적으로 완화시켰지만,&lt;span style=&quot;color: #000000;&quot;&gt; diffusion의 randomness 때문에&lt;/span&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt; temporal discontinuitie와 &lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;flickering&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;을 유발한다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;369&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;369&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;temporal discontinuitie와&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;b&gt;flickering&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;를 해결하기 위한 전략 제안&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;369&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;전략1. 3D Convolution 및 temporal attention과 같은 temporal layers를 추가하고 fine-tuning하는 방식&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;369&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;전략2. 사전학습된 모델에 cross-frame attention 또는 flow-guided attetntion을 zero-shot으로 적용하는 방식(학습x)&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;369&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;557&quot; data-start=&quot;467&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;그러나 여전히 두 가지 문제점이 존재한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;557&quot; data-start=&quot;467&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) 현재 방법들은 U-Net 또는 latent space에서 작동하기 때문에, low-level consistency 을 유지하는 데 한계가 있으며, &lt;b&gt;texture flickering&lt;/b&gt; 문제가 여전히 발생한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;914&quot; data-start=&quot;744&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) 기존의 temporal layers 및 attention mechanisms은 &lt;b&gt;global temporal consistency&lt;/b&gt;을 보장하는 데 한계가 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;914&quot; data-start=&quot;744&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;914&quot; data-start=&quot;744&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: GungSeo, serif; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;low-level consistency: &lt;/span&gt;&lt;/span&gt;픽셀 단위 또는 색상, 밝기, 텍스처와 같은 consistency&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;914&quot; data-start=&quot;744&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;비디오의 모든 프레임의 내용, 스타일, 움직임, 색상, 질감 등이 일관성 있게 유지되는 것&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;266&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;266&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;texture flickering&lt;/b&gt;를 해결하고 &lt;b&gt;global temporal consistency&lt;/b&gt;를 보장하기 위한 제안 방법:&lt;/u&gt;&lt;br /&gt;Video reconstruction 과정에서 local-global 방법 채택&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;266&quot; data-start=&quot;30&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1. Local level&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- pre-trained x4 image upscaling model에 temporal layers 추가 후 video 데이터로 fine-tuning. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;즉, U-Net과 VAE-Decoder에 temporal layers 통합함으로써 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;temporal layers가 추가된 U-Net은 temporal consistency를, VAE는 flickering을 줄인다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;2. Global level&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt; - &lt;span style=&quot;letter-spacing: 0px;&quot;&gt;학습 없이&lt;/span&gt;&lt;span style=&quot;letter-spacing: 0px;&quot;&gt;, flow-guided recurrent latent propagation module를 도입하여 &lt;/span&gt;짧은 비디오 세그먼트를 따라 양방향으로 프레임 간 전파와 latent fusion을 수행함으로써 전체 시퀀스의 stability를 향상시킨다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;프레임 간 warping과 latent fusion: optical flow에 기반하여 이전/미래 프레임의 z를 현재 프레임의 공간적 위치로 warping하여 가져와서 적절한 비율로 섞는다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;적절한 비율로 섞는다는 것: fusion 가중치인 &lt;span aria-hidden=&quot;true&quot;&gt;&amp;beta;&lt;/span&gt;가 이 비율을 조절한다. &lt;span aria-hidden=&quot;true&quot;&gt;&amp;beta;&lt;/span&gt;가 크면 warping 정보를 더 많이 반영하고, &lt;span aria-hidden=&quot;true&quot;&gt;&amp;beta;&lt;/span&gt;가 작으면 현재 프레임의 정보를 더 많이 반영한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;418&quot; data-end=&quot;578&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;또한, text prompt를 optional condition으로 활용하여 realistic하고 high-quality details을 생성할 수 있으며, 입력에 노이즈를 주입함으로써 모델의 robustness를 높이고, 노이즈의 강도를 조절하여 restoring과 generating 사이의 균형을 제어할 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.&amp;nbsp;Related&amp;nbsp;Work&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Video Super-Resolution.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;서론에서 언급된 것과 같이 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;기존의 대부분 방법들은 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;pre-defined degradation process을 가정하며, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;u&gt;real-world 환경에서는 일반화 성능의 한계&lt;/u&gt;로 인해 성능이 크게 저하되는 문제점이 있어&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp;입&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;력 비디오가 unknown degradations를 갖는다는 가정 하에 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;접근 방식을 시도하고 있다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;unknown degradations: &lt;/span&gt;&lt;/span&gt;real-world에서 카메라로 비디오를 촬영하거나, 저장하거나, 전송하는 과정에서 발생하는 모든 복합적이고 비선형적이며 예측하기 어려운 품질 저하 요소 &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;하지만, 현실 세계의&lt;/span&gt;&lt;span style=&quot;color: #953b34;&quot;&gt; &lt;u&gt;HR-LR 페어 데이터가 부족하다는 점&lt;/u&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;이 여전히 존재하기 때문에 이러한 점을 해결하고자 &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;아이폰(iPhone) 카메라를 이용해 HR-LR 데이터를 수집하는 방식이 제안되기도 하였지만 &lt;u&gt;다&lt;/u&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;u&gt;른 기기에 대한 일반화 가능성이 제한적&lt;/u&gt;이라는 문제점이 있었다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #953b34; font-family: 'Noto Serif KR';&quot;&gt; &lt;u&gt;HR-LR pair 데이터가 부족한 문제를 위한 해결 방법:&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;최근 연구들은 학습 시 다양한 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;degradation을&lt;/span&gt; data augmentation으로 적용하는 방향으로 전환하였다. &lt;span style=&quot;color: #000000;&quot;&gt;그럼에도 불구하고, 기존 CNN 기반 접근 방식들은&lt;/span&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;g&lt;u&gt;enerative prior가 부족&lt;/u&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;하여 p&lt;/span&gt;hoto-realistic textures을 생성하는 데 여전히 어려움이 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #1a5490;&quot;&gt; &lt;span style=&quot;text-align: justify;&quot;&gt;g&lt;/span&gt;&lt;u&gt;enerative prior가 부족한 문제를 해결하고자 제안한 방법:&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이에 본 연구는, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;pretrained image diffusion model인 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Stable Diffusion(SD) &amp;times;4 upscaler에 내재된 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;강력하고 일반화된 generate prior를 활용하는 데 초점을 맞춘다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;generate prior: &lt;/span&gt;전통적인 CNN 기반의 이미지 복원(예: SR) 모델들은 주로 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;원본 이미지와의 픽셀 단위 차이(L1, L2 손실)를 최소화하도록 학습하는&lt;/span&gt;&amp;nbsp;Regression 방식이기 때문에 평균화된 값을 사용하게 되고 디테일한 텍스처 생성은 어렵다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Diffusion&amp;nbsp;Models&amp;nbsp;for&amp;nbsp;Video&amp;nbsp;Tasks.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Video diffusion models 연구에서는 효율성을 위해 off-the-shelf image diffusion models을 활용하여 zero-shot 방식으로 비디오 생성을 시도한다. temporal consistency을 유지하기 위해 인접 프레임 간의 cross attention이나 optical flow을 활용한 warping이 사용된다. 그러나&lt;span style=&quot;color: #1a5490;&quot;&gt; &lt;u&gt;generalizability이 제한적&lt;/u&gt;&lt;/span&gt;이라는 단점이 있다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; &lt;/span&gt;off-the-shelf&lt;span&gt;:&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;이미 대용량의 데이터셋으로 학습해서 바로 사용할 수 있는 상태의 모델&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #1a5490;&quot;&gt; &lt;u&gt;generalizability이 제한적인 문제점을 해결하기 위한 방법:&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;가장 최근에는 Blattmann et al.이 pretrained image diffusion model을 비디오 도메인으로 확장하는 방법을 제안하였고, temporal dimension을 추가하고 temporal layers를 fine-tuning함으로써 학습 효율성을 높였다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;본 연구는 이러한 흐름에서 영감을 받아, pre-trained 모델의 generative prior로 활용하고, local-global temporal strategy을 제안한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Diffusion&amp;nbsp;Models&amp;nbsp;for&amp;nbsp;Restoration&amp;nbsp;Tasks.&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;diffusion 기반 이미지 복원의 가장 직관적인 방식은 LR 이미지를 조건으로 하여 처음부터 diffusion 모델을 학습하는 것이다. 그러나 이 방법은&lt;span style=&quot;color: #1a5490;&quot;&gt; &lt;u&gt;연산 자원이 요구된다는 단점&lt;/u&gt;&lt;/span&gt;이 있다. &lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;학습 비용을 줄이기 위한 방법:&lt;/u&gt; &lt;/span&gt;pre-trained diffusion model의 reverse diffusion 과정에 제약 조건을 추가한다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이 방식은 효율적이긴 하지만, 제약 조건이 사전 정의된 degradation 과정이나 기존 SR 모델에 의존하기 때문에 일반화 성능이 떨어지며 결과 품질도 제한적이다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;그 후 최근 연구들:&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;256&quot; data-start=&quot;139&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;frozen pretrained diffusion model에 소수의 학습 가능한 레이어만 추가하여 fine tuning하는 방식을 사용한다. 최근 연구들에 영감을 받아, 본 연구는 real-world VSR에 대해 효과적인 diffusion prior를 활용하는 데 중점을 둔다.&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3. Methodology &lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;다시 정리하자면 목적은 real world의 VSR에 적합한 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;text-guided diffusion framework를 개발하는 것이며&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp;diffusion model의 denoising process는&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;본질적으로 stochastic nature을 가지고 있어 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;비디오 태스크에 적용할 때 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp;temporal instability나&amp;nbsp; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;flickering artifact의 발생과 같은 어려움이 있다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;특히, 시퀀스가 길수록 더욱 두드러진다.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignRight&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;866&quot; data-origin-height=&quot;366&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bNfcQ6/btsOLiqYTY6/QmHhjLOEeSmKiMntB98DiK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bNfcQ6/btsOLiqYTY6/QmHhjLOEeSmKiMntB98DiK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bNfcQ6/btsOLiqYTY6/QmHhjLOEeSmKiMntB98DiK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbNfcQ6%2FbtsOLiqYTY6%2FQmHhjLOEeSmKiMntB98DiK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;866&quot; height=&quot;366&quot; data-origin-width=&quot;866&quot; data-origin-height=&quot;366&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Upscale-A-Video 모델의 특징&lt;/span&gt;&lt;/b&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;✔️각 확산 시간 단계 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; =1,2,&amp;hellip;,  에서 &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;비디오는 여러 segment으로 나뉘며 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;각 구간은 시간 레이어가 포함된 U-Net을 통해 처리되어 &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;해당 구간 내의 &lt;u&gt;local consistency을 보장&lt;/u&gt;한다.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;만약 현재 시간 단계가 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;사용자가 지정한 &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;global refinement&lt;span&gt; &lt;/span&gt;&lt;/span&gt;단계 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;lowast;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;에 해당된다면, &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;latent recurrent latent propagation module이 도입되어 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;구간 간 &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;압축된 z를 프레임 간에 반복적으로 전달하고 융합하면서 &lt;u&gt;global &lt;/u&gt;&lt;/span&gt;&lt;u&gt;consistency를 향상시킨다.&lt;/u&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;필요하다고 판단되는 단계에서만 활성화되기 때문에 효율성을 높인다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;finetuned VAE-Decoder를 사용하여 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;남아 있는 &lt;u&gt;flickering 아티팩트를 줄인다.&lt;/u&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;text prompts를 입력하면 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;video &lt;u&gt;realism하고 details한 결과를 생성&lt;/u&gt;할 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;사용자가 지정하는 노이즈 레벨을 통해 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;복원 &lt;u&gt;quality와 fidelity 사이의 트레이드오프&lt;/u&gt;를 조절할 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3.1.&amp;nbsp;Preliminary:&amp;nbsp;Diffusion&amp;nbsp;Models&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Pretrained&amp;nbsp;Stable&amp;nbsp;Diffusion&amp;nbsp;Image&amp;nbsp;&amp;times;4&amp;nbsp;Upscaler.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Upscale-A-Video는 pre-trained text-guided Stable Diffusion &amp;times;4 업스케일러(SD &amp;times;4 Upscaler)를 기반으로 구축되었으며 autoencoder 구조를 통해 이미지를 latent space로 변환하는 LDM(Latent Diffusion Model) 프레임워크를 사용한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;LDM&lt;/span&gt; 논문 아카이브 링크 ⬇️&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2112.10752&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2112.10752&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;인코더  : 입력 이미지를 4배 다운샘플하여  로 변환&lt;br /&gt;디코더  : 해당 z를 다시 고해상도 이미지로 복원 &lt;br /&gt;&lt;br /&gt;저해상도 이미지  를 조건으로 삼아 latent space 내에서 반복적인 denoising을 통해 고품질 이미지를 생성하는 법을 학습한다.&lt;br /&gt;실제 데이터에서 추출한 latent samples에 대해, 각 확산 단계  에서 가우시안 노이즈를 추가하여 노이즈가 섞인  _ 를 생성한다:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;171&quot; data-origin-height=&quot;23&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/sKKrw/btsOKQhJ9Ct/kOylNujCD0UpxqQDVriZO0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/sKKrw/btsOKQhJ9Ct/kOylNujCD0UpxqQDVriZO0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/sKKrw/btsOKQhJ9Ct/kOylNujCD0UpxqQDVriZO0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FsKKrw%2FbtsOKQhJ9Ct%2FkOylNujCD0UpxqQDVriZO0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;112&quot; height=&quot;23&quot; data-origin-width=&quot;171&quot; data-origin-height=&quot;23&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;모델의 생성 능력을 향상시키기 위해 입력 이미지에도 노이즈를 주입하는 방식 도 함께 사용된다:&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;175&quot; data-origin-height=&quot;22&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/8Vn5p/btsOLp4Vwf7/Bp3I3Rd7WphwviWMLcksXk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/8Vn5p/btsOLp4Vwf7/Bp3I3Rd7WphwviWMLcksXk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/8Vn5p/btsOLp4Vwf7/Bp3I3Rd7WphwviWMLcksXk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F8Vn5p%2FbtsOLp4Vwf7%2FBp3I3Rd7WphwviWMLcksXk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;119&quot; height=&quot;22&quot; data-origin-width=&quot;175&quot; data-origin-height=&quot;22&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;v-prediction 파라미터화를 채택하며, U-Net 기반 노이즈 제거 네트워크 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;  &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;는&amp;nbsp;다음을&amp;nbsp;예측하도록&amp;nbsp;학습된다:&lt;/span&gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;142&quot; data-origin-height=&quot;18&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bzrtNs/btsOL64S2ve/lJKLAcdlNZZNKgvIj2wqH1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bzrtNs/btsOL64S2ve/lJKLAcdlNZZNKgvIj2wqH1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bzrtNs/btsOL64S2ve/lJKLAcdlNZZNKgvIj2wqH1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbzrtNs%2FbtsOL64S2ve%2FlJKLAcdlNZZNKgvIj2wqH1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;110&quot; height=&quot;18&quot; data-origin-width=&quot;142&quot; data-origin-height=&quot;18&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;v-prediction parameterization: denoiser인 U-Net이 학습해야 할 target을 정의하는 방식 중 하나다. U-Net이 무엇을 예측하도록 훈련되는지를 규정하는 방법&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;U-Net이 예측하는 것은 노이즈, 원본 데이터, 속도/방향 3가지로 나눌 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;br /&gt;LDM의&amp;nbsp;학습&amp;nbsp;목적&amp;nbsp;함수는&amp;nbsp;다음과&amp;nbsp;같다:&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;314&quot; data-origin-height=&quot;36&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cU0JPU/btsOMk2S9sq/w8te1KkzXkBaaqEGR69HTK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cU0JPU/btsOMk2S9sq/w8te1KkzXkBaaqEGR69HTK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cU0JPU/btsOMk2S9sq/w8te1KkzXkBaaqEGR69HTK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcU0JPU%2FbtsOMk2S9sq%2Fw8te1KkzXkBaaqEGR69HTK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;244&quot; height=&quot;36&quot; data-origin-width=&quot;314&quot; data-origin-height=&quot;36&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;c는 text prompts 또는 입력 이미지의 노이즈 수준과 같은 condition을 포함할 수 있다.&lt;br /&gt;추론 시에는 모델이 다양한 텍스트 프롬프트와 노이즈 레벨을 활용할 수 있으며 최종적으로 샘플링된&amp;nbsp; 0를&amp;nbsp;디코딩하여&amp;nbsp;4배&amp;nbsp;업스케일된&amp;nbsp;이미지를&amp;nbsp;생성한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; Inflated 2D Convolution.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;341&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;✔️&lt;/span&gt;pretrained 2D diffusion 모델을 비디오 작업에 적용할 때 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;temporal layer를 통합하기 위해 &lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;기존의 2D convolution을 3D convolution으로 inflate시키는 것이 일반적인 방법이다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;341&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;✔️&lt;/span&gt;pretrained&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; Stable Diffusion &amp;times;4 업스케일러 사용&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;341&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;341&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;두 가지 단계 수행&lt;/u&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;341&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1) 기존 네트워크를 2D 컨볼루션에서 3D 컨볼루션으로 inflate한다&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;341&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) 이 업스케일러를 기반으로 모델을 초기화하고 비디오 도메인으로 transfer 학습을 한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;820&quot; data-start=&quot;693&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;820&quot; data-start=&quot;693&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;  Inflate: 2D (H,W), 3D (H,W,T)&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-end=&quot;820&quot; data-start=&quot;693&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 3.2. Local Consistency within Video Segments&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;pretrained text-to-image SD 모델을 비디오 관련 태스크에 적용하기 위해 기존의 VDM은 3D 컨볼루션, temporal attention, cross-frame attention의 기법을 활용했다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Finetuning Temporal U-Net.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;기존 연구들을 따라, pretrained image model에 temporal layers를 추가하고, 비디오 세그먼트 내의 local consistency 제약을 학습한다. Temporal U-Net에서는 3D 컨볼루션 기반의 3D residual blocks과 temporal attention을 시간적 레이어로 채택하여 기존의 사전 학습된 spatial layers&amp;nbsp;내에 삽입한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;temporal attention은 temporal dimension에 따라 self-attention을 수행하며 모든 local frame 간의 관계에 집중한다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;✔️&lt;/span&gt;temporal layers에는 &lt;u&gt;RoPE(Rotary Position Embedding)을 적용&lt;/u&gt;하여 시간 정보를 반영한 position embedding을 제공한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;RoPE&lt;/span&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;논문 아카이브 링크 ⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: start;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2104.09864&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2104.09864&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;716&quot; data-start=&quot;582&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이러한 시간 레이어들은 기존 이미지 모델과 동일한 노이즈 스케줄을 사용하여 학습되며 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;pretrained spatial layers는 고정시키고 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;삽입된 temporal layers만을 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;LDM의 학습 목적 함수를 &lt;/span&gt;이용해 최적화한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;738&quot; data-start=&quot;718&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;738&quot; data-start=&quot;718&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이 방법의 이점: &lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;738&quot; data-start=&quot;718&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) 대규모 고화질 이미지 데이터셋으로부터 학습된 &lt;u&gt;spatial 정보를 그대로 활용&lt;/u&gt;할 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;738&quot; data-start=&quot;718&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) 학습하는 레이어가 줄기 때문에 &lt;u&gt;학습 자원을 효율적으로 사용할 수 있다.&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; Finetuning Temporal VAE-Decoder. &lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;U-Net을 비디오 데이터에 대해 finetune한 이후에도 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이미지 전용으로 학습된 LDM 프레임워크 내의 VAE-Decoder는 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;latent sequence를 복원할 때 &lt;span style=&quot;color: #1a5490;&quot;&gt;&lt;u&gt;flickering 아티팩트를 여전히 생성&lt;/u&gt;&lt;/span&gt;하는 문제점이 있고 &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #1f1f1f;&quot;&gt;&lt;u&gt;&lt;span style=&quot;color: #953b34;&quot;&gt;U-Net의 diffusion denoising process는 종종 color shift를 유발&lt;/span&gt;&lt;/u&gt;하는데&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #1f1f1f;&quot;&gt;다른 diffusion 기반 복원 네트워크들에서도 나타나는 공통적인 문제이다.&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; color: #1a5490; text-align: start;&quot;&gt; flickering 아티팩트 문제를 해결하기 위한 방법:&lt;/span&gt;&lt;/u&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;VAE-Decoder에 temporal 3D residual blocks을 추가하여 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;low-level consistency을 강화한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #953b34;&quot;&gt;&lt;u&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;color shift 문제를 해결하기 위한 방법:&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;입력 비디오에 SFT(Spatial Feature Transform) 레이어를 적용하여 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;입력 정보를 VAE-Decoder의 첫 번째 레이어의 feature을 변형하도록 한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️&amp;nbsp;&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;SFT&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;논문 아카이브 링크 ⬇️&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/1804.02815&quot;&gt;https://arxiv.org/abs/1804.02815&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: justify;&quot;&gt; Color shift 문제:&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: justify;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: justify;&quot;&gt;원본 이미지의 색상이나 밝기 톤을 정확하게 유지하지 못하고 변화&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: justify;&quot;&gt;시키는 문제&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: justify;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;SFT(Spatial Feature Transform):&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp;원본 저해상도의 전체적인 색상 톤, 밝기 분포와 같은 low-frequency information을 추출하여&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&amp;nbsp;condition으로 사용한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Temporal layers 학습에 사용되는 하이브리드 loss function:&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1) L1 loss&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2) LPIPS perceptual loss&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3) T&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;emporal PatchGAN discriminator를 활용한&lt;/span&gt; Adversarial loss&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;LPIPS perceptual loss&lt;/span&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;논문 아카이브 링크 ⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/1801.03924&quot;&gt;https://arxiv.org/abs/1801.03924&lt;/a&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;T&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;emporal PatchGAN&lt;/span&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;논문 아카이브 링크 ⬇️&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2309.03897&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2309.03897&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 3.3. Global Consistency cross Video Segments&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;258&quot; data-start=&quot;104&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;기존 방법의 문제점:&lt;/span&gt;&lt;/u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;LDM내에서 학습된 temporal layers는 local sequence&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;(U-Net 설정에서는 8프레임)만 처리할 수 있어 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;global consistency을 보존하는 데 한계가 있다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;258&quot; data-start=&quot;104&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;258&quot; data-start=&quot;104&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;global consistency을 보존하기 위한 방법:&lt;/span&gt;&lt;/u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;flow-guided long-term propagation&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;가 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;temporal consistency을 향상시키는 데 유리하다는 것을 보여주었지만 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;short video clips만을 처리할 수 있는 &lt;/span&gt;&lt;/span&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;diffusion model에는 적합하지 않다&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;.&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-is-only-node=&quot;&quot; data-is-last-node=&quot;&quot; data-end=&quot;631&quot; data-start=&quot;492&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-is-only-node=&quot;&quot; data-is-last-node=&quot;&quot; data-end=&quot;631&quot; data-start=&quot;492&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Training-Free Recurrent Latent Propagation.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;latent space 내 training-free한 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;flow-guided recurrent propagation module을 제안한다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이 모듈은 긴 비디오 입력에서 global temporal coherence을 보장한다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;forward와 backward&amp;nbsp;두 방향으로 프레임 간 정보를 전파한다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1. &lt;u&gt;Optical flow estimation:&lt;/u&gt; RAFT 모델을 사용해서 optical flow를 추정한다(resizing 필요x).&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot; data-start=&quot;999&quot; data-end=&quot;1116&quot; data-is-last-node=&quot;&quot; data-is-only-node=&quot;&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2. 추정된 optical이 유효한지 체크하기 위해 forward-backward consistency error를 계산한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;824&quot; data-origin-height=&quot;85&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dsg90P/btsOKTyQ8Th/QKuauCJoHo2Tnh4N3IZsEk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dsg90P/btsOKTyQ8Th/QKuauCJoHo2Tnh4N3IZsEk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dsg90P/btsOKTyQ8Th/QKuauCJoHo2Tnh4N3IZsEk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fdsg90P%2FbtsOKTyQ8Th%2FQKuauCJoHo2Tnh4N3IZsEk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;368&quot; height=&quot;85&quot; data-origin-width=&quot;824&quot; data-origin-height=&quot;85&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3. 이전 프레임의 latent를 optical flow 정보 기반으로 현재 프레임 위치로 warping 후, latent fusion을 진행한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;721&quot; data-origin-height=&quot;140&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dt0i93/btsOKSUgTiH/IjWXcDlJgKF9gyKam2kagK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dt0i93/btsOKSUgTiH/IjWXcDlJgKF9gyKam2kagK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dt0i93/btsOKSUgTiH/IjWXcDlJgKF9gyKam2kagK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fdt0i93%2FbtsOKSUgTiH%2FIjWXcDlJgKF9gyKam2kagK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;335&quot; height=&quot;140&quot; data-origin-width=&quot;721&quot; data-origin-height=&quot;140&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4. 모든 diffusion step에 적용할 필요 없이 지정한 T*에서만 적용한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;439&quot; data-start=&quot;235&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;⬇️ RAFT 논문 아카이브 링크 ⬇️&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-is-only-node=&quot;&quot; data-is-last-node=&quot;&quot; data-end=&quot;1116&quot; data-start=&quot;999&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2312.06640&quot;&gt;https://arxiv.org/abs/2312.06640&lt;/a&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/pdf/2003.12039&quot;&gt;https://arxiv.org/pdf/2003.12039&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; ⬇️&amp;nbsp; &lt;span style=&quot;text-align: start;&quot;&gt;forward-backward consistency error&lt;/span&gt; 논문 링크 ⬇️ &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Serif KR';&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://ojs.aaai.org/index.php/AAAI/article/view/12276&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://ojs.aaai.org/index.php/AAAI/article/view/12276&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;3.4. Inference with Additional Conditions&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;u&gt;Additional condition을 조정&lt;/u&gt;하여 diffusion denoising process에 영향을 준다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;✔️&lt;/span&gt;&lt;/b&gt;Text prompts와 &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;Noise levels 조정의 &lt;/span&gt;효과: 텍스처 디테일을 생성할 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;✔️ &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;Classifier-Free Guidance(&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: start;&quot;&gt;CFG) 기법을 추가하여 위의 효과를 증폭시키는 데 도움을 준다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt; ⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;CFG&lt;/span&gt; 논문 아카이브 링크 ⬇️ &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2207.12598&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2207.12598&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1051&quot; data-origin-height=&quot;773&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cdQgtU/btsOLf9xiuK/0ATEATLCmqIDzCLYL65bq1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cdQgtU/btsOLf9xiuK/0ATEATLCmqIDzCLYL65bq1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cdQgtU/btsOLf9xiuK/0ATEATLCmqIDzCLYL65bq1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcdQgtU%2FbtsOLf9xiuK%2F0ATEATLCmqIDzCLYL65bq1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;684&quot; height=&quot;773&quot; data-origin-width=&quot;1051&quot; data-origin-height=&quot;773&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1103&quot; data-origin-height=&quot;450&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ezHKmJ/btsOL2VZXXr/KjqgIRMAJII97EuKxZNHak/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ezHKmJ/btsOL2VZXXr/KjqgIRMAJII97EuKxZNHak/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ezHKmJ/btsOL2VZXXr/KjqgIRMAJII97EuKxZNHak/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FezHKmJ%2FbtsOL2VZXXr%2FKjqgIRMAJII97EuKxZNHak%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1103&quot; height=&quot;450&quot; data-origin-width=&quot;1103&quot; data-origin-height=&quot;450&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 4&lt;span style=&quot;color: #000000;&quot;&gt;. Experiments &lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 4.1. Datasets and Implementation &lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt;Training Datasets.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1) WebVid10M의 서브셋&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;약 33.5만 개의 비디오-텍스트 쌍으로 구성&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;해상도는 약 336&amp;times;596&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;VDM 학습에 많이 사용되는 데이터셋&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2) YouHQ 데이터셋 (자체 수집)&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;YouTube에서 고화질 영상(1080&amp;times;1920) 약 3.7만 개 수집&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;거리, 풍경, 동물, 인물 얼굴, 정적 사물, 수중 장면, 야간 장면 등 &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;다양한 장면 포함&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;실제 환경에서의 VSR 생성 능력 향상에 기여&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;RealBasicVSR&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;의 degradation 파이프라인을 따라 LQ-HQ 비디오 쌍을 생성해 학습에 사용&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;br /&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;⬇️ &lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;RealBasicVSR&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;&amp;nbsp;논문 아카이브 링크 ⬇️&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/abs/2111.12704&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/abs/2111.12704&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt;Testing Datasets.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;총 4개의 합성(synthetic) 테스트 데이터셋 사용&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;SPMCS&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;UDM10&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;REDS30&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;YouHQ40 (YouHQ에서 테스트용으로 40개 영상 분리)&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;추가로 실제 환경의 성능 평가&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;VideoLQ (실제 저화질 영상)&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;AIGC30: 최신 텍스트-투-비디오 생성 모델로 만든 AI 영상 30개 수집&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt;Training Details.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;GPU: NVIDIA A100 80GB &amp;times; 32개&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;배치 사이즈: 384&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;입력 크기: 80&amp;times;80, 길이 8프레임&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Adam, learning rate =1&amp;times;10^(-4)&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;U-Net:&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;WebVid10M + YouHQ 합쳐서 70K iteration&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;그 후 YouHQ만 사용해 10K iteration 추가 학습&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;YouHQ에는 텍스트 프롬프트가 없기 때문에 null prompt를 사용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;VAE-Decoder:&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;StableSR 방식 따름&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;WebVid10M과 YouHQ에서 합성 LQ-HQ 비디오 쌍 10만 개 생성&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;학습된 U-Net으로 LQ 비디오에 대한 latent 코드 생성 &amp;rarr; 디코더 finetuning&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt;Evaluation Metrics.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;합성 데이터셋(GT있음)&lt;/span&gt;&lt;/u&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;PSNR&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;SSIM&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;LPIPS&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Flow warping error (E*warp)&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;실제 영상 및 AIGC 영상(GT 없음):&lt;/span&gt;&lt;/u&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;무참조(non-reference) 지표 사용&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;CLIP-IQA&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;MUSIQ&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;DOVER&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;4.2. Comparisons &lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;173&quot; data-start=&quot;103&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;최신 VSR(비디오 초해상도) 기법들과 성능을 비교했다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Real-ESRGAN&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Stable Diffusion &amp;times;4 Upscaler&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ResShift&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;StableSR&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;RealVSR&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;DBVSR&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;RealBasicVSR&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1464&quot; data-origin-height=&quot;697&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cipq9E/btsOLCpKaj2/Cpvqj6MInxIPw4OaKtH9R0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cipq9E/btsOLCpKaj2/Cpvqj6MInxIPw4OaKtH9R0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cipq9E/btsOLCpKaj2/Cpvqj6MInxIPw4OaKtH9R0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcipq9E%2FbtsOLCpKaj2%2FCpvqj6MInxIPw4OaKtH9R0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1464&quot; height=&quot;697&quot; data-origin-width=&quot;1464&quot; data-origin-height=&quot;697&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Quantitative Evaluation.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Upscale-A-Video는 4개의 합성 test 데이터셋에서 PSNR이 가장 높은 결과를 보여준다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;UDM10과 YouHQ40에서는 LPIPS 점수가 가장 낮게 나타나 생성된 영상의 perceptual quality가 매우 높음을 보여준다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;실제 비디오 VideoLQ와 AI 생성 영상 AIGC30에서도 CLIP-IQA와 DOVER 점수가 가장 높은 것을 볼 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Qualitative Evaluation.&lt;/span&gt; &lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1527&quot; data-origin-height=&quot;472&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dcO3W8/btsOMOo8MT2/eW4f6iP5B3oBb70x4IyYHk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dcO3W8/btsOMOo8MT2/eW4f6iP5B3oBb70x4IyYHk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dcO3W8/btsOMOo8MT2/eW4f6iP5B3oBb70x4IyYHk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdcO3W8%2FbtsOMOo8MT2%2FeW4f6iP5B3oBb70x4IyYHk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1527&quot; height=&quot;472&quot; data-origin-width=&quot;1527&quot; data-origin-height=&quot;472&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1541&quot; data-origin-height=&quot;753&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/MEcgf/btsOKQa6Bcd/HtGrHdhBGdhezpPZ1RP9Z1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/MEcgf/btsOKQa6Bcd/HtGrHdhBGdhezpPZ1RP9Z1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/MEcgf/btsOKQa6Bcd/HtGrHdhBGdhezpPZ1RP9Z1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FMEcgf%2FbtsOKQa6Bcd%2FHtGrHdhBGdhezpPZ1RP9Z1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1541&quot; height=&quot;753&quot; data-origin-width=&quot;1541&quot; data-origin-height=&quot;753&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Temporal Consistency.&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;176&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #000000; text-align: start; font-family: 'Noto Serif KR';&quot;&gt; local-global temporal strategy를 통해 우수한 temporal consistency 성능을 보여준다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;176&quot; data-start=&quot;73&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #000000; text-align: start; font-family: 'Noto Serif KR';&quot;&gt;UDM10 데이터셋에서 optical flow error가 가장 낮은 점수를 기록하였다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;540&quot; data-origin-height=&quot;412&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bFuSkt/btsOLeJyfrb/U2xgbnI7KHjVw09Kv34c9k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bFuSkt/btsOLeJyfrb/U2xgbnI7KHjVw09Kv34c9k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bFuSkt/btsOLeJyfrb/U2xgbnI7KHjVw09Kv34c9k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbFuSkt%2FbtsOLeJyfrb%2FU2xgbnI7KHjVw09Kv34c9k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;329&quot; height=&quot;412&quot; data-origin-width=&quot;540&quot; data-origin-height=&quot;412&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-end=&quot;457&quot; data-start=&quot;321&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Upscale-A-Video가 기존의 다른 확산 기반 영상 기법들보다 뛰어난 성능을 보일 뿐만 아니라 RealBasicVSR와 DBVSR 같은 강력한 CNN 기반 VSR 모델들 보다도 뛰어남을 의미한다. 또한, temporal profile visualization를 통해 더 부드럽고 매끄러운 전환을 이루는 모습을 확인할 수 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;394&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/sHU3Y/btsONMqQqIo/Q0vWYmuNkCPng5wxyZ0F0k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/sHU3Y/btsONMqQqIo/Q0vWYmuNkCPng5wxyZ0F0k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/sHU3Y/btsONMqQqIo/Q0vWYmuNkCPng5wxyZ0F0k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FsHU3Y%2FbtsONMqQqIo%2FQ0vWYmuNkCPng5wxyZ0F0k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1118&quot; height=&quot;394&quot; data-origin-width=&quot;1118&quot; data-origin-height=&quot;394&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 4.3. Ablation Study&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; Effectiveness of Finetuned VAE-Decoder.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;VAE-Decoder를 finetuning 하는 것이 얼마나 중요한지를 평가했다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Table 2에서 확인할 수 있듯이 VAE-Decoder를 원래 decoder로 대체할 경우 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;PSNR, SSIM, 그리고 특히 E*warp 값이 악화되었다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;특히 E*warp가 0.737 &amp;rarr; 1.815로 증가하면서 temporal consistency가 크게 저하됨을 보여준다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Fig. 6의 비교 결과도 fine-tuned VAE-Decoder 없이 생성한 영상은 더 많은 flickering가 있음을 시각적으로 보여준다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; Effectiveness of Propagation Module.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;VAE-Decoder 외에도 flow-guided recurrent latent propagation module을 제안하여 긴 영상의 stability를 향상시켰다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Table 2와 같이 이 모듈을 도입하면 E*warp 오류가 추가적으로 감소하여&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: justify;&quot;&gt;&amp;nbsp;temporal consistency가&lt;/span&gt; 더욱 개선된다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;또한, PSNR 같은 프레임 단위 성능은 그대로 유지된다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Fig. 6의 temporal profile에서도 프레임 간 전환이 더 부드럽고 자연스러워짐을 확인할 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; Text Prompt.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Text prompt가 있는 경우와 없는 경우(null prompt) 모두에 대해 학습했다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Classifier-Free Guidance 기법을 사용하여 샘플링 시 시각 품질을 향상시켰다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Fig. 7를 보면 적절한 t&lt;span style=&quot;font-family: 'Noto Serif KR'; background-color: #ffffff; text-align: start;&quot;&gt;ext prompt&lt;/span&gt;를 사용하는 경우 더 섬세하고 사실적인 디테일이 생성됨을 확인할 수 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; Noise Level.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;입력에 추가하는 noise level 역시 모델 성능에 영향을 준다. noise level이 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;낮을수록 결과가 흐릿하고 디테일이 부족한 경향이 있는 반면, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;너무 큰 노이즈는oversharpening을 유발할 수 있다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;따라서 복원력과 생성력 사이의 균형을 위해 noise level를 조절하는 것이 중요하다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-end=&quot;1116&quot; data-start=&quot;999&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 5. Conclusion &lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; ✔️문제점: diffusion models 기반 real world VSR 분야에서는 연구가 부족하다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;✔️제안방법: Upscale-A-Video&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;✔️Upscale-A-Video: image diffusion prior을 real world VSR에 효과적으로 활용하면서 diffusion 과정의 내재된 randomness로 인한 temporal discontinuity 문제를 피할 수 있도록 설계되었다. 특히, LDM 내에서 local-global temporal strategy을 도입함으로써 temporal coherence를 효과적으로 강화했다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #1f1f1f; text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;✔️추가적인 방법: text prompt를 통한 질감 생성, 노이즈 수준 조절을 통한 복원력-품질 간 트레이드오프 조절 기능을 함께 도입하여 실제 환경에서의 실용성과 유연성을 높이는 데 기여했다. &lt;/span&gt;&lt;/p&gt;</description>
      <category>Paper Review/Video Super-Resolution</category>
      <category>Diffusion</category>
      <category>ldm</category>
      <category>upscale-a-video</category>
      <category>VSR</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/108</guid>
      <comments>https://honey-vision.tistory.com/108#entry108comment</comments>
      <pubDate>Fri, 20 Jun 2025 18:33:41 +0900</pubDate>
    </item>
    <item>
      <title>[논문리뷰] Learning Transferable Visual Models From Natural Language Supervision</title>
      <link>https://honey-vision.tistory.com/95</link>
      <description>&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Clip(Contrastive Language&amp;ndash;Image Pretraining) 논문리뷰를 해보려고 한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;본 논문은 2021년도에 나왔으며 OpenAI에서 발표한 논문이다.&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Abstract&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;기존 컴퓨터 비전 분야에서의 문제점&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 이미 정의 되어져 있는 객체 카테고리를 예측하도록 훈련되는 지도 학습 방법을 사용한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 다른 visual concept을 학습하기 위해 추가적으로 레이블 데이터가 필요하다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 모델의 generality와 usability를 제한한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;대안 방법&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이미지에 대한 raw text로부터 직접 학습하는 것이다. &lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;1.&amp;nbsp;Introduction&amp;nbsp;and&amp;nbsp;Motivating&amp;nbsp;Work&lt;/span&gt;&lt;/h2&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;NLP 분야의 혁신적인 발전&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- raw text로부터 직접 학습하는 사전 학습 방법을 사용했다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 다양한 비종속적인 태스크에서 여러 차례 확장되었으며, 지속적으로 성능을 향상시켰다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- &quot;Texy-Text&quot;의 인풋 -아웃풋 방법은 Downstream datasets으로 zero-shot transfer를 가능하게 했다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 특수한 출력 헤드나 데이터셋별 맞춤 조정이 필요하지 않게 되었다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ex) GPT-3&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 이러한 결과는 웹 규모 텍스트 컬렉션을 활용한 현대적인 사전 학습 방법이 고품질 크라우드 레이블링된 NLP 데이터셋의 감독 정보를 능가할 수 있음을 시사한다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #dddddd;&quot;&gt;그러나 컴퓨터 비전 분야에서는 여전히 crowd-labeled datasets을 사용하여 모델을 학습한다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #dddddd;&quot;&gt;텍스트로부터 직접 학습하는 방법을 컴퓨터 비전에도 적용한다면 비슷한 혁신을 가져올 수 있을까?&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;관련 연구&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;-&amp;nbsp; Images와 짝지어진 text documents에서 명사와 형용사를 예측하여 콘텐츠 기반 이미지 검색 개선&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- Manifold learning&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- low-level image 및 text tag feature 위에 multimodal Deep Boltzmann Machine을 훈련&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 이미지 캡션의 단어를 예측하도록 훈련된 CNN이 유용한 이미지 표현을 학습 입증&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- YFCC100M 데이터셋의 이미지에 대한 제목, 설명, 해시태그 메타데이터를 단어 집합(Bag-of-Words) 기반 다중 레이블 분류 작업으로 변환하고, 레이블을 예측하도록 AlexNet을 사전 학습시켰을 때 전이 학습 태스크에서 ImageNet 사전 학습과 유사한 성능을 보이는 표현을 학습&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 단어 단위와 n-grams(연속된 단어 묶음) 예측, &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이미지 분류 데이터셋에 대해 zero-shot transfer할 수 있는 시스템의 능력 입증&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- VirTex, ICMLM, ConVIRT&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #dddddd;&quot;&gt;Natural Language Supervision을 사용하는 사례는 여전히 드물다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #dddddd;&quot;&gt;-&amp;gt; 이는 일반적인 벤치마크에서 입증된 성능이 대체 접근법들보다 훨씬 낮기 때문&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #dddddd;&quot;&gt;대신, 더 좁은 범위로 제한되었지만 목적에 맞게 설계된 Weakly&amp;nbsp;Supervised&amp;nbsp;Learning이 성능을 개선해왔다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; text-align: justify;&quot;&gt;Weakly Supervised Learning&lt;/span&gt; &lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;제한된 양의 지도 학습된 &quot;gold-labels&quot; 데이터와 대용량의 raw text를 사용한 학습 사이의 실용적인 절충점을 나타낸다. Mahajan et al. (2018)과 Kolesnikov et al. (2019) 두 연구 모두 감독 신호를 신중히 설계했지만, 그 과정에서 각각 1000개 및 18291개의 클래스에 한정했다. 자연어는 그 일반성을 통해 훨씬 더 폭넓은 시각적 개념을 표현할 수 있으며, 이를 감독 신호로 사용할 수 있다. 두 접근법 모두 정적인 소프트맥스 분류기를 사용해 예측을 수행하며, 동적인 출력(새로운 데이터에 대한 대응)을 위한 메커니즘이 부족하다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 모델의 유연성 제한, Z&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ero-Shot 성능 제한&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;background-color: #dddddd; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt; &lt;span style=&quot;color: #000000; text-align: justify;&quot;&gt;Natural Language Supervision&lt;/span&gt; 과 &lt;span style=&quot;color: #000000; text-align: justify;&quot;&gt;Weakly Supervised Learning&lt;/span&gt; 의 차이점은 scale(규모)다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;background-color: #dddddd; font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;본 연구에서는 이 격차를 해소하고, 대규모 natural language supervision으로 학습된 이미지 분류기의 동작을 연구한다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2. Approach&lt;/span&gt;&lt;/h2&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.1.&amp;nbsp;Natural&amp;nbsp;Language&amp;nbsp;Supervision&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;Zhang et al.(2020), Gomez et al.(2017), Joulin et al.(2016), 그리고 Desai &amp;amp; Johnson(2020)은 모두 (text, image) pair을 사용했지만 각각 unsupervised, self-supervised, weakly supervised, supervised으로 &lt;b&gt;자신들의 접근 방식을 설명했다.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;초기 연구들은 토픽 모델과 n-그램 표현을 사용할 때 자연어의 complexity의 문제가 있었다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;-&amp;gt; Deep contextual representation learning의 발전으로 이제는 방대한 양의 데이터를 효과적으로 활용할 수 있는 도구를 갖추게 되었다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;강점&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #dddddd;&quot;&gt;- 주석을 작성할 필요가 없으므로 확장하기 쉽다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #dddddd;&quot;&gt;- 해당 representation을 언어와 연결하여 유연한 zero-shot transfer를 가능하게 한다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.2.&amp;nbsp;Creating&amp;nbsp;a&amp;nbsp;Sufficiently&amp;nbsp;Large&amp;nbsp;Dataset&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; 기존 연구들이 사용한 데이터셋&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;MS-COCO와 Visual Genome: 고품질의 라벨 데이터이지만, 규모가 작다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt; &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;YFCC100M: 각 이미지의 &lt;/span&gt;메타데이터가 부족하거나 품질이 고르지 않다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;본 논문에서 사용하는 데이터셋 &lt;span style=&quot;color: #000000; text-align: justify;&quot;&gt;WebImageText(WIT)&lt;/span&gt; &lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 데이터가 인터넷에 대량으로 공개되어 있다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 인터넷에서 공개된 다양한 소스에서 수집한 4억 개의 (image, text) 쌍으로 구성된 새로운 데이터셋을 만들었다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 가능한 한 다양한 visual concepts을 포괄하기 위해, 데이터셋 구축 과정에서 50만 개의 쿼리 중 하나를 포함하는 텍스트가 있는 (image, text) 쌍을 검색했다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 쿼리당 최대 2만 개의&lt;span style=&quot;font-family: 'Noto Serif KR'; text-align: justify;&quot;&gt;&amp;nbsp;(image,&amp;nbsp;text) &lt;/span&gt;쌍을 포함시켜 데이터셋의 클래스 균형을 맞췄다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.3.&amp;nbsp;Selecting&amp;nbsp;an&amp;nbsp;Efficient&amp;nbsp;Pre-Training&amp;nbsp;Method&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;정확한 라벨 데이터로 학습시킨 모델의 문제점&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;초기 접근 방식은 VirTex와 유사하게, 이미지 CNN과 text transformer를 처음부터 함께 학습하여 이미지의 캡션을 예측하는 방식이었다. 그러나 효율적으로 확장하는 데 어려움이 있었다. 그림 2에서는 transformer language model이 ResNet-50 보다 2배 많은 연산량을 사용함에도 불구하고, ImageNet 클래스 인식 속도가 3배 느리다는 것을 보여준다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;745&quot; data-origin-height=&quot;786&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bhqXqr/btsLSsqwi6z/PkKHaNiitIyIlo5DXuGcQ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bhqXqr/btsLSsqwi6z/PkKHaNiitIyIlo5DXuGcQ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bhqXqr/btsLSsqwi6z/PkKHaNiitIyIlo5DXuGcQ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbhqXqr%2FbtsLSsqwi6z%2FPkKHaNiitIyIlo5DXuGcQ1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;365&quot; height=&quot;385&quot; data-origin-width=&quot;745&quot; data-origin-height=&quot;786&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #dddddd;&quot;&gt;-&amp;gt; 이러한 이유는 이미지와 함께 제공되는 텍스트의 정확한 단어를 예측하려고 시도하기 때문이다.&lt;/span&gt; 최근 이미지 대조 학습 연구에서는 contrastive objective가 더 우수한 표현 학습을 가능하게 한다는 것을 발견했다. 또 다른 연구에서는 생성 모델이 고품질 이미지 표현을 학습할 수 있지만, 많은 연산량을 필요로 한다는 것을 발견했다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이러한 연구 결과를 바탕으로, 텍스트의 개별 단어가 아닌 전체 텍스트가 어떤 이미지와 쌍을 이루는지를 예측하는 비교적 더 쉬운 과제를 학습하는 시스템을 탐구했다. 동일한 bag-of-words 인코딩 베이스라인에서 predictive objective를 contrastive objective로 변경한 결과, ImageNet zero-shot transfer 속도가 4배 향상됨을 확인했다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;학습방법&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 배치 내 N &amp;times; N 개의 가능한 (image, text) 조합 중에서 실제로 일치하는 쌍을 예측하도록 훈련된다. &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 이미지 인코더와 텍스트 인코더를 함께 학습&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- N개의 이미지 및 텍스트 임베딩의 cosine similarity 최대화&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- N&amp;sup2; &amp;ndash; N개의 잘못된 조합의 cosine similarity 최소화&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 유사도 점수를 기반으로 대칭적인 cross entropy loss 최적화&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1101&quot; data-origin-height=&quot;520&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bIQhQ1/btsLS6Oj1jV/QJ5oacXKIbrJL2GW9ITo31/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bIQhQ1/btsLS6Oj1jV/QJ5oacXKIbrJL2GW9ITo31/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bIQhQ1/btsLS6Oj1jV/QJ5oacXKIbrJL2GW9ITo31/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbIQhQ1%2FbtsLS6Oj1jV%2FQJ5oacXKIbrJL2GW9ITo31%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;763&quot; height=&quot;360&quot; data-origin-width=&quot;1101&quot; data-origin-height=&quot;520&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;544&quot; data-origin-height=&quot;575&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/o7gli/btsLTrdCfZe/V2RS1mXotpnXTQxe0UKt10/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/o7gli/btsLTrdCfZe/V2RS1mXotpnXTQxe0UKt10/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/o7gli/btsLTrdCfZe/V2RS1mXotpnXTQxe0UKt10/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fo7gli%2FbtsLTrdCfZe%2FV2RS1mXotpnXTQxe0UKt10%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;353&quot; height=&quot;373&quot; data-origin-width=&quot;544&quot; data-origin-height=&quot;575&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.4.&amp;nbsp;Choosing&amp;nbsp;and&amp;nbsp;Scaling&amp;nbsp;a&amp;nbsp;Model&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;이미지 인코더&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 수정된 ResNet-50 사용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- global average pooling layer -&amp;gt; 어텐션 풀링 메커니즘으로 교체&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- &lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;layer normalization을 추가하고 초기화 방식을 수정한 &lt;/span&gt;Vision Transformer 사용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 모델 확장해서 사용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;&lt;b&gt; 텍스트 인코더&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 트랜스포머 모델 사용&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 모델 확장해서 사용&lt;/span&gt;&lt;/p&gt;
&lt;h3 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;2.5.&amp;nbsp;Training&lt;/span&gt;&lt;/h3&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 5개의 ResNet 모델과 3개의 ViT 모델 학습 &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- ResNet-50, ResNet-101 학습&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- ResNet-50의 연산량을 각각 4배(&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;RN50x4&lt;/span&gt;), 16배(&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;RN50x16&lt;/span&gt;), 64배(&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; text-align: justify;&quot;&gt;RN50x64)&lt;/span&gt;사용한 3개의 추가 모델을 학습 &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- ViT-B/32, ViT-B/16, ViT-L/14 학습(&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;32 에포크 동안 학습)&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- Adam 옵티마이저 사용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- weight decay 정규화 모든 가중치에 적용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 코사인 스케줄 사용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 초기 하이퍼파라미터는 ResNet-50 모델을 1 에포크 동안 학습하여 그리드 탐색, 랜덤 탐색, 수동 조정을 조합해 설정 &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 더 큰 모델에서는 경험적인 방식(heuristic)으로 하이퍼파라미터 조정&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 타우는 0.07로 초기화, logit 값이 100을 초과하지 않도록 제한&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 32,768 미니배치 크기 사용&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- Mixed-precision을 사용해 훈련 속도를 높이고 메모리 절약&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 추가적인 메모리 절약을 위해 그라디언트 체크포인팅, 반정밀도 Adam 통계, 반정밀도 확률적 반올림된 텍스트 인코더 가중치 사용 &lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 임베딩 유사도 계산도 분산 처리하여 개별 GPU가 해당 로컬 배치의 일부 유사도 계산&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 가장 큰 ResNet 모델 RN50x64는 592개의 V100 GPU에서 18일 동안 학습&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- 가장 큰 Vision Transformer 모델은 256개의 V100 GPU에서 12일 동안 학습&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: justify;&quot; data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;- ViT-L/14 모델에서는 성능을 향상시키기 위해 336픽셀 해상도에서 1 에포크 추가 사전 학습을 수행(&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000;&quot;&gt;ViT-L/14@336px)&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 style=&quot;text-align: justify;&quot; data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;9.&amp;nbsp;Conclusion&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;NLP에서 과제 비종속적인 웹 규모 사전 학습의 성공을 다른 분야로 전이할 수 있는지 알아보았고, 컴퓨터 비전 분야에도 적용했을때 유사한 행동이 나타난다는 것을 발견했다. &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;CLIP 모델은 훈련 목표를 최적화하기 위해 사전 훈련 중에 프롬포트를 활용하는 등 다양한 작업을 수행하는 방법을 학습해서 다운스트림 태스크를 가능하게 한다. 그러나 여전히 개선의 여지가 많다.&lt;/span&gt;&lt;/p&gt;</description>
      <category>Paper Review</category>
      <category>clip</category>
      <category>clip paper review</category>
      <category>contrastive language&amp;ndash;image pretraining</category>
      <category>learning transferable visual models from natural language supervision</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/95</guid>
      <comments>https://honey-vision.tistory.com/95#entry95comment</comments>
      <pubDate>Thu, 16 Jan 2025 01:25:23 +0900</pubDate>
    </item>
    <item>
      <title>Keras로 Vision Transformer 예제 실행하기</title>
      <link>https://honey-vision.tistory.com/82</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;케라스 기반으로 만들어진 비전 트랜스포머를 실행하고 코드를 공부해보자.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;케라스 공식 홈페이지는 아래 링크에서 확인하면 된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;a href=&quot;https://keras.io/examples/vision/image_classification_with_vision_transformer/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://keras.io/examples/vision/image_classification_with_vision_transformer/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1729127032736&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Keras documentation: Image classification with Vision Transformer&quot; data-og-description=&quot;► Code examples / Computer Vision / Image classification with Vision Transformer Image classification with Vision Transformer Author: Khalid Salama Date created: 2021/01/18 Last modified: 2021/01/18 Description: Implementing the Vision Transformer (ViT) &quot; data-og-host=&quot;keras.io&quot; data-og-source-url=&quot;https://keras.io/examples/vision/image_classification_with_vision_transformer/&quot; data-og-url=&quot;https://keras.io/examples/vision/image_classification_with_vision_transformer/&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/ltO0L/hyXhXE9Ubr/evkKpVWWHvmVN6yXl2SO81/img.png?width=774&amp;amp;height=269&amp;amp;face=0_0_774_269,https://scrap.kakaocdn.net/dn/hx1Bs/hyXhYxjn25/nIDz1VyQInBDC2q5cn1nNk/img.png?width=222&amp;amp;height=222&amp;amp;face=0_0_222_222,https://scrap.kakaocdn.net/dn/K0zya/hyXlPZMOJD/kabFMM2M6d9DkGGkrxL8kk/img.png?width=584&amp;amp;height=456&amp;amp;face=0_0_584_456&quot;&gt;&lt;a href=&quot;https://keras.io/examples/vision/image_classification_with_vision_transformer/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://keras.io/examples/vision/image_classification_with_vision_transformer/&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/ltO0L/hyXhXE9Ubr/evkKpVWWHvmVN6yXl2SO81/img.png?width=774&amp;amp;height=269&amp;amp;face=0_0_774_269,https://scrap.kakaocdn.net/dn/hx1Bs/hyXhYxjn25/nIDz1VyQInBDC2q5cn1nNk/img.png?width=222&amp;amp;height=222&amp;amp;face=0_0_222_222,https://scrap.kakaocdn.net/dn/K0zya/hyXlPZMOJD/kabFMM2M6d9DkGGkrxL8kk/img.png?width=584&amp;amp;height=456&amp;amp;face=0_0_584_456');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Keras documentation: Image classification with Vision Transformer&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;► Code examples / Computer Vision / Image classification with Vision Transformer Image classification with Vision Transformer Author: Khalid Salama Date created: 2021/01/18 Last modified: 2021/01/18 Description: Implementing the Vision Transformer (ViT)&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;keras.io&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이번 코드는 논문 이해를 돕기 위한 간단한 예시 정도로, &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;attention이 이미 라이브러리로 불러올 수 있도록 되어 있어서 직접 구현하는건 따로 해봐야 될 듯 하다.&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;pre id=&quot;code_1729127865184&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os

os.environ[&quot;KERAS_BACKEND&quot;] = &quot;jax&quot;  # @param [&quot;tensorflow&quot;, &quot;jax&quot;, &quot;torch&quot;]

import keras
from keras import layers
from keras import ops

import numpy as np
import matplotlib.pyplot as plt&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;필요 라이브러리를 불러오고 백엔드 설정을 미리 한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729127933342&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;num_classes = 100
input_shape = (32, 32, 3)

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()

print(f&quot;x_train shape: {x_train.shape} - y_train shape: {y_train.shape}&quot;)
print(f&quot;x_test shape: {x_test.shape} - y_test shape: {y_test.shape}&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;cifar100 데이터셋을 사용함으로 넘클래스는 100으로 설정한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;각 이미지는 32x32 크기이고, RGB 3채널로 되어 있다 &amp;rarr; (32,32,3)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;케라스에서 제공되는 데이터셋을 이미지와 레이블을 구분하여 불러온다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;훈련 데이터(x_train)와 레이블(y_train)의 형태를 출력하여 shape을 확인한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;케스트 데이터도 마찬가지로 shape 확인!&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729129346686&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;learning_rate = 0.001
weight_decay = 0.0001
batch_size = 256
num_epochs = 10  # For real training, use num_epochs=100. 10 is a test value
image_size = 72  # We'll resize input images to this size
patch_size = 6  # Size of the patches to be extract from the input images
num_patches = (image_size // patch_size) ** 2&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;learning rate, weight decay, batch size, epoch를 설정한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;num_patches는 이미지에서 추출되는 패치의 수를 계산한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;패치는 겹치지 않고 추출되기 때문에 패치사이즈에서 이미지 픽셀 수를 나눠주면 된다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729130756273&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;projection_dim = 64
num_heads = 4&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;선형 투영의 차원과 헤드 수를 설정한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729149390764&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;transformer_units = [
    projection_dim * 2,
    projection_dim,
]  # Size of the transformer layers&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; FFN(Feed-forward Neural Network)의 유닛(노드)를 정한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;첫 번째 레이어는 차원을 확장해서 풍부한 정보를 얻고, &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;두 번째 레이어에서 다시 차원을 축소하여 중요한 정보를 유지한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729149397094&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;transformer_layers = 8
mlp_head_units = [
    2048,
    1024,
]  # Size of the dense layers of the final classifier&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;레이어는 8개로 설정하고 레이어 통과 후 적용되는 MLP(Multi-Layer Perceptron)의 유닛 수를 정한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2048개, 다음 레이어에서 1024 피처를 생성한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729151838188&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;data_augmentation = keras.Sequential(
    [
        layers.Normalization(),
        layers.Resizing(image_size, image_size),
        layers.RandomFlip(&quot;horizontal&quot;),
        layers.RandomRotation(factor=0.02),
        layers.RandomZoom(height_factor=0.2, width_factor=0.2),
    ],
    name=&quot;data_augmentation&quot;,
)
# Compute the mean and the variance of the training data for normalization.
data_augmentation.layers[0].adapt(x_train)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;데이터 어그멘테이션을 한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;정규화, 리사이징, 수평으로 뒤집기, 무작위 회전, 무작위 줌을 적용한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;data_augmentation.layers[0].adapt(x_train)는 훈련 데이터를 사용해 Normalization 레이어가 평균과 분산을 미리 학습하여 이후 데이터의 정규화 과정을 수행할 준비를 하는 단계다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;즉, 데이터 픽셀 값에 맞춰서 평균과 분산을 구하고 적절한 값으로 정규화한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729153165687&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def mlp(x, hidden_units, dropout_rate):
    for units in hidden_units:
        x = layers.Dense(units, activation=keras.activations.gelu)(x)
        x = layers.Dropout(dropout_rate)(x)
    return x&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;다음은 MLP를 구현하는 과정이다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;x: 입력데이터&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;hidden_units: dense layer에 사용할 유닛 수&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;dropout_rate: 드롭아웃 비율&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;비전 트랜스포머는 GELU 활성화 함수를 사용한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;겔루 함수는 렐루 함수 처럼 음수는 곧 0. 양수는 1로 극단적이지 않고 부드럽게 적용된다는 특징이 있다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;235&quot; data-origin-height=&quot;49&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cVwk0S/btsKbfLlqne/6pjJnsfrYKrtfEBKdMxJbk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cVwk0S/btsKbfLlqne/6pjJnsfrYKrtfEBKdMxJbk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cVwk0S/btsKbfLlqne/6pjJnsfrYKrtfEBKdMxJbk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcVwk0S%2FbtsKbfLlqne%2F6pjJnsfrYKrtfEBKdMxJbk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;235&quot; height=&quot;49&quot; data-origin-width=&quot;235&quot; data-origin-height=&quot;49&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;220&quot; data-origin-height=&quot;183&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cD9wgW/btsKaoCx6c9/MnogdFwm7LdaUdW1IWkwF0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cD9wgW/btsKaoCx6c9/MnogdFwm7LdaUdW1IWkwF0/img.png&quot; data-alt=&quot;출처: Wikipidia&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cD9wgW/btsKaoCx6c9/MnogdFwm7LdaUdW1IWkwF0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcD9wgW%2FbtsKaoCx6c9%2FMnogdFwm7LdaUdW1IWkwF0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;220&quot; height=&quot;183&quot; data-origin-width=&quot;220&quot; data-origin-height=&quot;183&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;출처: Wikipidia&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;pre id=&quot;code_1729153394817&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;class Patches(layers.Layer):
    def __init__(self, patch_size):
        super().__init__()
        self.patch_size = patch_size

    def call(self, images):
        input_shape = ops.shape(images)
        batch_size = input_shape[0]
        height = input_shape[1]
        width = input_shape[2]
        channels = input_shape[3]
        num_patches_h = height // self.patch_size
        num_patches_w = width // self.patch_size
        patches = keras.ops.image.extract_patches(images, size=self.patch_size)
        patches = ops.reshape(
            patches,
            (
                batch_size,
                num_patches_h * num_patches_w,
                self.patch_size * self.patch_size * channels,
            ),
        )
        return patches

    def get_config(self):
        config = super().get_config()
        config.update({&quot;patch_size&quot;: self.patch_size})
        return config&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이제 입력 이미지에서 패치를 추출하는 함수를 구현한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;설정한 패지 사이즈 크기 만큼으로 이미지를 분할하고 각 패치를 평탄화하여 변환한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;[batch_size, height, width, channel] 이러한 형태로 입력으로 이미지가 들어오게 된다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; num_patches_h = height // self.patch_size &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;num_patches_w = width // self.patch_size &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이미지를 패치 사이즈에 맞게 나누어 생성되는 패치의 수를 계산한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;예를 들어 이미지 사이즈가 4x4이고 패치 사이즈를 2x2로 설정했다면, &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이미지 높이/패치사이즈 = 4/2 = 2&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이미지 너비/패치사이즈 = 4/2 = 2&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2*2=4 총 4개의 패치가 나오게 된다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;695&quot; data-origin-height=&quot;404&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dO0wEn/btsKbCsLz8T/B9n0YKwQYEWNM42sFhqVvk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dO0wEn/btsKbCsLz8T/B9n0YKwQYEWNM42sFhqVvk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dO0wEn/btsKbCsLz8T/B9n0YKwQYEWNM42sFhqVvk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdO0wEn%2FbtsKbCsLz8T%2FB9n0YKwQYEWNM42sFhqVvk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;502&quot; height=&quot;292&quot; data-origin-width=&quot;695&quot; data-origin-height=&quot;404&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;722&quot; data-origin-height=&quot;393&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/boxlHX/btsJ9lNf6WV/0AmnMqkBu87EoAajgKrRkk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/boxlHX/btsJ9lNf6WV/0AmnMqkBu87EoAajgKrRkk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/boxlHX/btsJ9lNf6WV/0AmnMqkBu87EoAajgKrRkk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FboxlHX%2FbtsJ9lNf6WV%2F0AmnMqkBu87EoAajgKrRkk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;513&quot; height=&quot;279&quot; data-origin-width=&quot;722&quot; data-origin-height=&quot;393&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;723&quot; data-origin-height=&quot;380&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ba8zkT/btsKar67woR/7YGCf4GYXjA41uBu5lepek/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ba8zkT/btsKar67woR/7YGCf4GYXjA41uBu5lepek/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ba8zkT/btsKar67woR/7YGCf4GYXjA41uBu5lepek/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fba8zkT%2FbtsKar67woR%2F7YGCf4GYXjA41uBu5lepek%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;499&quot; height=&quot;262&quot; data-origin-width=&quot;723&quot; data-origin-height=&quot;380&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; image.extract_patches 함수를 사용해 패치로 분할한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;분할 후에는 각 패치를 평탄화 한다.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; (batch_size, num_patches, flattened_patch_size)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;배치 사이즈는 잠시 생략하고 (&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;num_patches, flattened_patch_size) 형태를 입력 행렬이라고 하자.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;여기까지 코드가 패치를 분할하고 평탄화 시키는 작업이다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; &lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;flattened_patch_size&lt;/span&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;&amp;nbsp;= 2x2x3 = 12&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;예시 이미지에서 입력행렬은 (4,12)가 될 것이다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729153416358&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;plt.figure(figsize=(4, 4))
image = x_train[np.random.choice(range(x_train.shape[0]))]
plt.imshow(image.astype(&quot;uint8&quot;))
plt.axis(&quot;off&quot;)

resized_image = ops.image.resize(
    ops.convert_to_tensor([image]), size=(image_size, image_size)
)
patches = Patches(patch_size)(resized_image)
print(f&quot;Image size: {image_size} X {image_size}&quot;)
print(f&quot;Patch size: {patch_size} X {patch_size}&quot;)
print(f&quot;Patches per image: {patches.shape[1]}&quot;)
print(f&quot;Elements per patch: {patches.shape[-1]}&quot;)

n = int(np.sqrt(patches.shape[1]))
plt.figure(figsize=(4, 4))
for i, patch in enumerate(patches[0]):
    ax = plt.subplot(n, n, i + 1)
    patch_img = ops.reshape(patch, (patch_size, patch_size, 3))
    plt.imshow(ops.convert_to_numpy(patch_img).astype(&quot;uint8&quot;))
    plt.axis(&quot;off&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; x_train에서 랜덤한 이미지를 선택하여 4x4 크기의 플롯에 표시하고, 이미지를 이전에 설정했던 패치 사이즈로 나눈다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;패치 분할 수, 패치 크기, 총 개수를 출력하여 확인한다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;340&quot; data-origin-height=&quot;108&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bhVqDu/btsJ93ZG7e0/8p06arnYtgLJUI5QGzApxk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bhVqDu/btsJ93ZG7e0/8p06arnYtgLJUI5QGzApxk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bhVqDu/btsJ93ZG7e0/8p06arnYtgLJUI5QGzApxk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbhVqDu%2FbtsJ93ZG7e0%2F8p06arnYtgLJUI5QGzApxk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;280&quot; height=&quot;89&quot; data-origin-width=&quot;340&quot; data-origin-height=&quot;108&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 패치의 총 개수에 맞추어 nxn 그리드를 생성하고, 각 패치를 원래의&amp;nbsp; 2차원 형태로 reshape한 후, 개별적으로 시각화한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;원래 이미지를 학습 시킬 때 리사이징한 크기로 맞추어서 출력한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;아래 이미지가 코드 실행 후 출력된 결과이다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;327&quot; data-origin-height=&quot;324&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Hfx5F/btsKbzpmOaN/QXV0uM9PD5WpGwuIPTkcdK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Hfx5F/btsKbzpmOaN/QXV0uM9PD5WpGwuIPTkcdK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Hfx5F/btsKbzpmOaN/QXV0uM9PD5WpGwuIPTkcdK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FHfx5F%2FbtsKbzpmOaN%2FQXV0uM9PD5WpGwuIPTkcdK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;327&quot; height=&quot;324&quot; data-origin-width=&quot;327&quot; data-origin-height=&quot;324&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;314&quot; data-origin-height=&quot;304&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OqPmi/btsKbAIvBkz/B9ikXkocQyfkMIYvhhcoo0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OqPmi/btsKbAIvBkz/B9ikXkocQyfkMIYvhhcoo0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OqPmi/btsKbAIvBkz/B9ikXkocQyfkMIYvhhcoo0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOqPmi%2FbtsKbAIvBkz%2FB9ikXkocQyfkMIYvhhcoo0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;314&quot; height=&quot;304&quot; data-origin-width=&quot;314&quot; data-origin-height=&quot;304&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;pre id=&quot;code_1729153432296&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;class PatchEncoder(layers.Layer):
    def __init__(self, num_patches, projection_dim):
        super().__init__()
        self.num_patches = num_patches
        self.projection = layers.Dense(units=projection_dim)
        self.position_embedding = layers.Embedding(
            input_dim=num_patches, output_dim=projection_dim
        )

    def call(self, patch):
        positions = ops.expand_dims(
            ops.arange(start=0, stop=self.num_patches, step=1), axis=0
        )
        projected_patches = self.projection(patch)
        encoded = projected_patches + self.position_embedding(positions)
        return encoded

    def get_config(self):
        config = super().get_config()
        config.update({&quot;num_patches&quot;: self.num_patches})
        return config&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이전까지 인코더에 적용할 수 있도록 작업을 해주었다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이제 패치를 받아 Linear Projection을 하고 Positional Embedding을 추가하는 레이어를 구현하는 부분이다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 패치의 위치 정보를 포함하는 임베딩 레이어로 패치의 위치를 임베딩 벡터로 변환하여 각 패치의 위치 정보를 더한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;positions는 패치 수 만큼&amp;nbsp; 위치 인덱스를 생성한다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; get_config 함를 통해 모델을 저장하고 나중에 불러올 때 레이어의 설정값을 저장한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729153451893&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def create_vit_classifier():
    inputs = keras.Input(shape=input_shape)
    # Augment data.
    augmented = data_augmentation(inputs)
    # Create patches.
    patches = Patches(patch_size)(augmented)
    # Encode patches.
    encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)

    # Create multiple layers of the Transformer block.
    for _ in range(transformer_layers):
        # Layer normalization 1.
        x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
        # Create a multi-head attention layer.
        attention_output = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=projection_dim, dropout=0.1
        )(x1, x1)
        # Skip connection 1.
        x2 = layers.Add()([attention_output, encoded_patches])
        # Layer normalization 2.
        x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
        # MLP.
        x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
        # Skip connection 2.
        encoded_patches = layers.Add()([x3, x2])

    # Create a [batch_size, projection_dim] tensor.
    representation = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
    representation = layers.Flatten()(representation)
    representation = layers.Dropout(0.5)(representation)
    # Add MLP.
    features = mlp(representation, hidden_units=mlp_head_units, dropout_rate=0.5)
    # Classify outputs.
    logits = layers.Dense(num_classes)(features)
    # Create the Keras model.
    model = keras.Model(inputs=inputs, outputs=logits)
    return model&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;ViT 모델을 구현하는 함수다. 함수를 블록만큼 반복하며 패치 간의 관계를 학습하고 최종적으로 분류를 수행한다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729153465071&quot; class=&quot;python&quot; data-ke-language=&quot;python&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;def run_experiment(model):
    optimizer = keras.optimizers.AdamW(
        learning_rate=learning_rate, weight_decay=weight_decay
    )

    model.compile(
        optimizer=optimizer,
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[
            keras.metrics.SparseCategoricalAccuracy(name=&quot;accuracy&quot;),
            keras.metrics.SparseTopKCategoricalAccuracy(5, name=&quot;top-5-accuracy&quot;),
        ],
    )

    checkpoint_filepath = &quot;/tmp/checkpoint.weights.h5&quot;
    checkpoint_callback = keras.callbacks.ModelCheckpoint(
        checkpoint_filepath,
        monitor=&quot;val_accuracy&quot;,
        save_best_only=True,
        save_weights_only=True,
    )

    history = model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=num_epochs,
        validation_split=0.1,
        callbacks=[checkpoint_callback],
    )

    model.load_weights(checkpoint_filepath)
    _, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)
    print(f&quot;Test accuracy: {round(accuracy * 100, 2)}%&quot;)
    print(f&quot;Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%&quot;)

    return history


vit_classifier = create_vit_classifier()
history = run_experiment(vit_classifier)


def plot_history(item):
    plt.plot(history.history[item], label=item)
    plt.plot(history.history[&quot;val_&quot; + item], label=&quot;val_&quot; + item)
    plt.xlabel(&quot;Epochs&quot;)
    plt.ylabel(item)
    plt.title(&quot;Train and Validation {} Over Epochs&quot;.format(item), fontsize=14)
    plt.legend()
    plt.grid()
    plt.show()


plot_history(&quot;loss&quot;)
plot_history(&quot;top-5-accuracy&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;마지막으로 top-5와 loss를 시각화하는 코드다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;ViT 논문에서 옵티마이저를 Adam으로 했는데 여기서는 AdamW를 사용했다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;코드 실행 후 그래프 결과이다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;579&quot; data-origin-height=&quot;450&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/B30vB/btsKbgcq37y/itTFImGfuleACz1eLHihw1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/B30vB/btsKbgcq37y/itTFImGfuleACz1eLHihw1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/B30vB/btsKbgcq37y/itTFImGfuleACz1eLHihw1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FB30vB%2FbtsKbgcq37y%2FitTFImGfuleACz1eLHihw1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;579&quot; height=&quot;450&quot; data-origin-width=&quot;579&quot; data-origin-height=&quot;450&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt; 근데 loss 그래프.. 어디갔어? 실종되어서 다시 확인이 필요하다........&lt;/span&gt;&lt;/p&gt;
&lt;figure contenteditable=&quot;false&quot; data-ke-type=&quot;emoticon&quot; data-ke-align=&quot;alignCenter&quot; data-emoticon-type=&quot;friends2&quot; data-emoticon-name=&quot;014&quot; data-emoticon-isanimation=&quot;false&quot; data-emoticon-src=&quot;https://t1.daumcdn.net/keditor/emoticon/friends2/large/014.png&quot;&gt;&lt;img src=&quot;https://t1.daumcdn.net/keditor/emoticon/friends2/large/014.png&quot; width=&quot;150&quot; /&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review</category>
      <category>keras</category>
      <category>Vision Transformer</category>
      <category>VIT</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/82</guid>
      <comments>https://honey-vision.tistory.com/82#entry82comment</comments>
      <pubDate>Thu, 17 Oct 2024 17:24:46 +0900</pubDate>
    </item>
    <item>
      <title>[논문리뷰] Self-training with Noisy Student improves ImageNet classification</title>
      <link>https://honey-vision.tistory.com/77</link>
      <description>&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;이번 논문은 라벨링 되지 않은 데이터를 활용한 이미지 분류 논문을 리뷰한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;라벨링 되지 않은 데이터는 그냥 이미지 데이터다. 어떠한 폴더나 클래스로 구분을 하지 않은 이미지들.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;리뷰 후에는 구현 코드에 관한 글을 쓰고자 한다.&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Abstract&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Semi supervised learning접근법인 Noisy Student Training 방법 제시&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Student model에 Noise를 더하여 학습시키고 teacher model과 같거나 더 큰 student model를 사용함으로써 self-training과 distillation 방법을 확장시킴&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;1. Introduction&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 여전히 vision model은 unlabeled를 사용하는 것보다 labeled 이미지로 학습을 시킨다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Unlabeled 이미지로 학습시켜 크게 3단계를 통해 정확도와 강인성을 높혔다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Noisy Student Training은 두 가지 방법으로 self-training과 distillation 방법을 개선했다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Noisy Student Training 방법으로 ImageNet 데이터셋에서 SOTA 모델 성능보다 2% 높혔고 robustness(강인성)도 좋아졌다 -&amp;gt; table1 참고&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1153&quot; data-origin-height=&quot;463&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/It915/btsJCjHEUcx/sKmGFbgRvpIcoZBFJUk811/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/It915/btsJCjHEUcx/sKmGFbgRvpIcoZBFJUk811/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/It915/btsJCjHEUcx/sKmGFbgRvpIcoZBFJUk811/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FIt915%2FbtsJCjHEUcx%2FsKmGFbgRvpIcoZBFJUk811%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;431&quot; height=&quot;173&quot; data-origin-width=&quot;1153&quot; data-origin-height=&quot;463&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;2. Noisy Student Training Algorithm1:Noisy Student Training &lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Noisy Student Training은 semi-supervised learning 방법인 self-training과 distillation의 개선된 방법&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 핵심 아이디어는 studednt model에 noise를 추가한 것과 student model로 teacher model보다 같거나 큰 모델을 사용했다는 것&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;547&quot; data-origin-height=&quot;558&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cHWtqZ/btsJAHKbxSn/K31Qk68OAKZbzu9FtaWz8k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cHWtqZ/btsJAHKbxSn/K31Qk68OAKZbzu9FtaWz8k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cHWtqZ/btsJAHKbxSn/K31Qk68OAKZbzu9FtaWz8k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcHWtqZ%2FbtsJAHKbxSn%2FK31Qk68OAKZbzu9FtaWz8k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;458&quot; height=&quot;467&quot; data-origin-width=&quot;547&quot; data-origin-height=&quot;558&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;759&quot; data-origin-height=&quot;527&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cxGUnz/btsJAHDpNnb/owrNDGFF8OV9YCl0yc3y50/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cxGUnz/btsJAHDpNnb/owrNDGFF8OV9YCl0yc3y50/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cxGUnz/btsJAHDpNnb/owrNDGFF8OV9YCl0yc3y50/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcxGUnz%2FbtsJAHDpNnb%2FowrNDGFF8OV9YCl0yc3y50%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;531&quot; height=&quot;369&quot; data-origin-width=&quot;759&quot; data-origin-height=&quot;527&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Noising Student&lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Teacher model은 pseudo ladels를 만들어내야 함으로 student model에만 input noise, model noise를 추가한다 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Unlabeled에 noise를 적용했을때 decision function의 일관성을 얻게 된다 -&amp;gt; 데이터가 변형되어도 같은 클래스로 맞춤, dropout과 stochastic depth에 noise를 추가함으로써 앙상블 효과 발생 &lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Other Techniques &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Noisy Student Training은 data filtering이나 balancing과 같은 추가적인 기법을 사용하면 더 좋다 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- soft or hard pseudo ladels을 사용하는 방법이 있는데 도메인 외 unlabeled 데이터에서는 soft pseudo labels가 조금 더 좋았다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #c0d1e7;&quot;&gt;논문에서 말하는 soft label과 hard label의 차이는 아래 그림과 같다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #000000; background-color: #c0d1e7;&quot;&gt;soft label이 각 클래스에 대해 예측한 확률값을 의미하고 hard label은 0 또는 1로 예측하는 방법이다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;928&quot; data-origin-height=&quot;198&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/7NMYp/btsJBopIzwW/DhdzlVDnSUpAhPH0b2jea0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/7NMYp/btsJBopIzwW/DhdzlVDnSUpAhPH0b2jea0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/7NMYp/btsJBopIzwW/DhdzlVDnSUpAhPH0b2jea0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F7NMYp%2FbtsJBopIzwW%2FDhdzlVDnSUpAhPH0b2jea0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;583&quot; height=&quot;124&quot; data-origin-width=&quot;928&quot; data-origin-height=&quot;198&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Comparisons with Existing SSL &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Methods 기존 SSL(Semi-Supervised Learning) 방법과 비교했을때, 기존 SSL은 teacher model을 분리하지 못하고 pseudo labels를 일관성있도록 학습하지 못한다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 해결 방법이 있다 하더라도 ImageNet과 같은 대규모 데이터셋에서는 사용하기 어렵다 &lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3. Experiments &lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;SOTA model과 비교한 결과를 설명하고 robustness 데이터셋에서 놀라운 결과를 증명한다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start; background-color: #c0d1e7;&quot;&gt;robustness 데이터셋이라는건 단어 그대로 강인한 지를 나타내는 건데&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start; background-color: #c0d1e7;&quot;&gt;즉, 모델이 얼마나 강한가? 라는 것이다. 복잡한 데이터임에도 불구하고 잘 맞추는가를 보는 것.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start; background-color: #c0d1e7;&quot;&gt;강인성을 테스트하기 위한 데이터셋에는 회전된 이미지, 흐리게 한 이미지 등 효과가 적용된 이미지를 사용한다.&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3.1. Experiment Details&lt;/span&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Labeled dataset &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;ImageNet 데이터셋 사용&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Unlabeled dataset &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- JFT 데이터셋으로부터 300M개의 unlabeled 이미지를 얻었고 이미지에 label이 있더라도 무시하고 사용했다 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- data filtering and balancing 기법 사용 공정한 비교를 위해 YFCC 100M개의 공식 데이터셋을 사용(결과: 부록 A.4)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;555&quot; data-origin-height=&quot;726&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ujOCp/btsJA9TUl8t/vTkkkAnqAuVVrJ6wAQImY0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ujOCp/btsJA9TUl8t/vTkkkAnqAuVVrJ6wAQImY0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ujOCp/btsJA9TUl8t/vTkkkAnqAuVVrJ6wAQImY0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FujOCp%2FbtsJA9TUl8t%2FvTkkkAnqAuVVrJ6wAQImY0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;464&quot; height=&quot;607&quot; data-origin-width=&quot;555&quot; data-origin-height=&quot;726&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; color: #333333; text-align: start; background-color: #c0d1e7;&quot;&gt;JFT 데이터셋은 구글이 자체적으로 만든 데이터셋으로 3억개 정도의 방대한 데이터로 구성되어져 있다.&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Architecture &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;더 나은 용량을 제공하는 EfficientNets을 baseline model로 사용&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Training details&lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Labeled images의 경우 batch size는 2048를 사용했고 메모리가 부족하면 줄여서 적용하는 것도 괜찮다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Unlabeled images의 경우 더 큰 batch size를 사용했다 처음에는 작은 해상도로 350 epochs 학습시키고 model를 finetuning하여 labeled images를 1.5 epochs 학습시킨다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Unlabeled batch size는 labeled batich size에 14배 크다&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Noise &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Stochastic depth, dropout(0.5), RandAugment를 student에 사용했다 Stochastic depth는 final layer에서 0.8을 설정하고 linear decay rule에 따라 다음 레이어부터 감소한다 RandAugment는 두 가지 효과를 적용한다&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Iterative training&lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- best model은 3번 반복 학습을 했을 때였다.&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3.2. ImageNet Results&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Noisy Student Training을 적용한 Efficient-L2는 top-1 acc에서 88.4%를 달성했다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- Efficientnet에서 가장 높은 acc는 85.0%다 3.4%는 두 가지로 얻을 수 있었다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 모델의 크기(+0.5%), Noisy Student Training(+2.9%) 두 가지에서 성능 향상이 있었는데, Noisy Student Training가 모델을 바꾸는 것보다 더 큰 영향이 있다는 것을 알 수 있다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 3.5 Billion의 labeled images로 학습된 ResNeXt-101 WSL 보다 데이터가 적게 필요하고 파라미터도 비교적 적다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;Model Size Study&lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- iterative training없이 Noisy Student Training만 적용했을때 모든 모델이 기존 성능보다 높았다 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- 이 결과를 통해 Noisy Student Training이 다른 vision model에서도 이점을 얻을 수 있다는 것을 확인할 수 있다(Figure 2)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;742&quot; data-origin-height=&quot;746&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/du9trw/btsJBoQ34ik/DtuLcJbmXnODXHhb60Q4wK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/du9trw/btsJBoQ34ik/DtuLcJbmXnODXHhb60Q4wK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/du9trw/btsJBoQ34ik/DtuLcJbmXnODXHhb60Q4wK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fdu9trw%2FbtsJBoQ34ik%2FDtuLcJbmXnODXHhb60Q4wK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;583&quot; height=&quot;586&quot; data-origin-width=&quot;742&quot; data-origin-height=&quot;746&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;3.3. Robustness Results on ImageNet-A, ImageNet-C and ImageNet-P ImageNet-A, ImageNet-C and ImageNet-P &lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;- robustness &lt;span style=&quot;letter-spacing: 0px;&quot;&gt;실험 결과(table 4,5,6) ImageNet-A top-1 acc 61.0% -&amp;gt; 83.7% 증가 ImageNet-C mCE 45.7% -&amp;gt; 28.3% 감소 ImageNet-P mFR 27.8% -&amp;gt; 12.2% 감소&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;730&quot; data-origin-height=&quot;471&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bAVcB3/btsJARe9TIJ/SWKn3GlhBFJ2Kr4MF4WAzK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bAVcB3/btsJARe9TIJ/SWKn3GlhBFJ2Kr4MF4WAzK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bAVcB3/btsJARe9TIJ/SWKn3GlhBFJ2Kr4MF4WAzK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbAVcB3%2FbtsJARe9TIJ%2FSWKn3GlhBFJ2Kr4MF4WAzK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;533&quot; height=&quot;344&quot; data-origin-width=&quot;730&quot; data-origin-height=&quot;471&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;734&quot; data-origin-height=&quot;495&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cidHpJ/btsJCeUtyHp/4Y0DBCuy8Jirfm6cRTJHx1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cidHpJ/btsJCeUtyHp/4Y0DBCuy8Jirfm6cRTJHx1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cidHpJ/btsJCeUtyHp/4Y0DBCuy8Jirfm6cRTJHx1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcidHpJ%2FbtsJCeUtyHp%2F4Y0DBCuy8Jirfm6cRTJHx1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;534&quot; height=&quot;360&quot; data-origin-width=&quot;734&quot; data-origin-height=&quot;495&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;735&quot; data-origin-height=&quot;505&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cIUShT/btsJCy546Q3/DKUNxCdkKkIyYObnQCWDx0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cIUShT/btsJCy546Q3/DKUNxCdkKkIyYObnQCWDx0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cIUShT/btsJCy546Q3/DKUNxCdkKkIyYObnQCWDx0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcIUShT%2FbtsJCy546Q3%2FDKUNxCdkKkIyYObnQCWDx0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;539&quot; height=&quot;370&quot; data-origin-width=&quot;735&quot; data-origin-height=&quot;505&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1030&quot; data-origin-height=&quot;723&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eB5BcF/btsJCxsy88v/lxUDbFYINGQTqmpHt3UrlK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eB5BcF/btsJCxsy88v/lxUDbFYINGQTqmpHt3UrlK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eB5BcF/btsJCxsy88v/lxUDbFYINGQTqmpHt3UrlK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FeB5BcF%2FbtsJCxsy88v%2FlxUDbFYINGQTqmpHt3UrlK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;757&quot; height=&quot;531&quot; data-origin-width=&quot;1030&quot; data-origin-height=&quot;723&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px; background-color: #c0d1e7;&quot;&gt;데이터셋에서 검정 텍스트가 본 논문에서 제시한 방법으로 학습을 시킨 모델로 예측한 결과이고&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px; background-color: #c0d1e7;&quot;&gt;빨간 텍스트가 Noisy Student Training을 적용하지 않은 모델의 결과라고 합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Qualitative Analysis &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- 이미지를 통한 시각적 분석 &lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;3.4. Adversarial Robustness Results &lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;의도적으로 만든 공격(FGSM, PGD)에서 잘 견디는 결과를 확인했다&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;734&quot; data-origin-height=&quot;692&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bRnwLk/btsJB5XNGlw/A8y0It676ZD4PlgPr1V7f1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bRnwLk/btsJB5XNGlw/A8y0It676ZD4PlgPr1V7f1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bRnwLk/btsJB5XNGlw/A8y0It676ZD4PlgPr1V7f1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbRnwLk%2FbtsJB5XNGlw%2FA8y0It676ZD4PlgPr1V7f1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;470&quot; height=&quot;443&quot; data-origin-width=&quot;734&quot; data-origin-height=&quot;692&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;4. Ablation Study &lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- 어떤 요소가 모델 성능에 영향을 미쳤는지 분석한다 &lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;4.1. The Importance of Noise in Self-training&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- teacher model의 pseudo labels로 student model을 학습하게 되면 cross entropy loss는 0이 되고 학습이 되지 않는 문제가 생기기 때문에 noise가 필요하다 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- noise를 하나씩 제거해가며 실험해 본 결과 noise를 적용했을 때보다 정확도가 떨어지는 것을 알 수 있었다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- 그럼에도 정확도가 높은 이유는 많은 unlabeled 데이터와 SGD 영향 때문이다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- teacher model에 noise를 추가하면 오히려 성능이 떨어진다&lt;/span&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;4.2. A Study of Iterative Training&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- labeled data를 활용하여 EfficientNet-B7를 학습시켜 teacher model를 만든다 &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- teacher model를 통해 만들어진 pseudo labels로 unlabeled data를 학습시며 student model를 만든다 -&amp;gt; EfficientNet-L2 model 이 과정을 iterative training 하여 좋은 성능을 내었다&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;730&quot; data-origin-height=&quot;742&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dIst0I/btsJCdH67nx/f2S0X3dOorcKDSU1kRdKZ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dIst0I/btsJCdH67nx/f2S0X3dOorcKDSU1kRdKZ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dIst0I/btsJCdH67nx/f2S0X3dOorcKDSU1kRdKZ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdIst0I%2FbtsJCdH67nx%2Ff2S0X3dOorcKDSU1kRdKZ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;469&quot; height=&quot;477&quot; data-origin-width=&quot;730&quot; data-origin-height=&quot;742&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1113&quot; data-origin-height=&quot;475&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bBvZyx/btsJB4dxdEN/o4QpwPOOMQtFsC7uji4Tv0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bBvZyx/btsJB4dxdEN/o4QpwPOOMQtFsC7uji4Tv0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bBvZyx/btsJB4dxdEN/o4QpwPOOMQtFsC7uji4Tv0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbBvZyx%2FbtsJB4dxdEN%2Fo4QpwPOOMQtFsC7uji4Tv0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;555&quot; height=&quot;237&quot; data-origin-width=&quot;1113&quot; data-origin-height=&quot;475&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;4.3. Additional Ablation Study Summarization&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- Noisy Student Training에서 사용된 여러 선택의 중요성을 요약 독자들이 실용적인 가이드를 얻을 수 있도록 함&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #1. 큰 teacher model 사용하기&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #2. 많은 양의 unlabeled data 필요&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #3. 일부 경우에서 out-of-domain data에 대해 soft pseudo labels가 더 좋다(hard보다)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #4. 큰 student model 중요&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #5. 작은 모델에서 data balancing 중요&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #6. unlabeled data로 학습하고 튜닝하는 방법보다 labeled data와 unlabeled를 같이 학습하는 것이 더 좋다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #7. unlabeled batch size를 labeled batch size 보다 크게 하기&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Finding #8. student model의 가중치가 처음부터 학습되는 것이 teacher model의 가중치를 물려받는 것보다 때때로 더 좋을 수 있다&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;5. Related works&lt;/span&gt;&lt;/h2&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Self-training &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- 기존 self-training은 student model에 noise를 적용하지 않거나 noise의 역할이 충분히 정의되지 않았다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- 본 논문은 noise의 중요성을 강조하고 student model에 적극적으로 적용한다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Yalniz et al : 성능, 강인성 떨어지는 문제&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Parthasarathi et al : unlabeled 학습이었지만 학생이 교사보다 좋지 못함&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;co-training : 동일한 레이블이 없는 데이터에 대해 예측을 잘못할 수 있음&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Semi-supervised Learning&lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- consistency training에 기반한 방법으로 학습 중인 모델이 pseudo labels를 만들어서 높은 entropy 상태로 정규화가 되고 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;좋은 성능을 달성하기 어렵다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- ImageNet과 같은 대규모 데이터셋에서 사용하기 어렵다 &lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Knowledge Distillation&lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- unlabeled를 고려하지 않고 student model이 작다&lt;/span&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;Robustness &lt;/span&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- vision model에서 robustness 부족 문제를 해결하는 것이 중요했다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- Noisy Student Training 방법은 직접적으로 robustness를 최적화하지 않아도 성능이 향상된다&lt;/span&gt;&lt;/p&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style6&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;6. Conclusion&lt;/span&gt;&lt;/h2&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR'; letter-spacing: 0px;&quot;&gt;- 기존 weakly-supervised learning 연구들은 수십억 개의 약하게 labeled data를 필요로 했지만 본 논문에서는 unlabeled images를 사용하여 ImageNet 모델의 정확도와 강인성을 크게 향상시킬 수 있음을 제시한다&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review</category>
      <category>noisy</category>
      <category>Paper review</category>
      <category>self-training</category>
      <category>self-training with noisy student improves imagenet classification</category>
      <author>honey-vision</author>
      <guid isPermaLink="true">https://honey-vision.tistory.com/77</guid>
      <comments>https://honey-vision.tistory.com/77#entry77comment</comments>
      <pubDate>Mon, 12 Aug 2024 16:39:42 +0900</pubDate>
    </item>
  </channel>
</rss>